Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.
In decision theory, we often talk about programs that know their own source code. I'm very confused about how that theory applies to people, or even to computer programs that don't happen to know their own source code. I've managed to distill my confusion into three short questions:
1) Am I uncertain about my own source code?
2) If yes, what kind of uncertainty is that? Logical, indexical, or something else?
3) What is the mathematically correct way for me to handle such uncertainty?
Don't try to answer them all at once! I'll be glad to see even a 10% answer to one question.
I recently stumbled upon an article from early 2003 in Physics World outlining a bit of evidence that some of the constants in nature may change over time. In this particular case, researchers studying quasars noticed that the fine-structure constant (α) might have fluctuated a bit billions of years ago, in both directions (bigger and smaller) with significance 4.1 sigma. What intrigues me about this is that I’ve previously pondered if something like this might be found, albeit for very different reasons.
Back in the 90s I read a book that made a case for the universe as a computer simulation. That particular book wasn’t all that compelling to me, but I’ve never been completely satisfied with arguments against that model and tend to think of the universe generally in those terms anyway. Can I still call myself an atheist if I allow the possibility of a creator in this context? A non-practicing atheist maybe?
If this universe is a computer-generated simulation, programmed by another life form, perhaps the search for extraterrestrial intelligence (SETI) should be expanded to include life forms beyond our universe. It sounds nonsensical, but is it?
If I was to design and code an environment sophisticated enough to allow a species of life to evolve in that environment, I am not convinced that I would have many tools at my disposal to truly be able to understand and evaluate that species very well. Sure, I may be able to see them generating patterns that indicate intelligent life within my simulation, but this life form evolved and exists in an environment completely alien to me. I might have only limited methods at my disposal through which to communicate with them. They would exist in a place that to me is not exactly real and vice-versa.
I’ve always imagined it would be more like evaluating patterns and data readouts or viewing cells through a microscope more than say something like, The Sims. Having designed and implemented the very laws of their universe though, the fundamental constants of the universe could act as a sort of communication channel – one that allows me to at the very least let them know I existed (assuming they were intelligent and were looking). I could modify those constants in such a way over time in much the same manner that we might try to communicate with the more local and familiar concept of alien.
I realize this is all just rambling, but because the alpha is so closely related to those parts of nature that allow for our own existence, it made me take notice, and wonder if this could be some sort of alpha mail. The thought of being able to communicate with an external intelligence is thought provoking enough for me that I decided to write this as my first post here. Who knows? If it ever was confirmed, perhaps we could turn out to be the paper clip maximizer, and we should start looking for our ticket out of here.
This is a thread for rationality-related or LW-related jokes and humor. Please post jokes (new or old) in the comments.
Q: Why are Chromebooks good Bayesians?
A: Because they frequently update!
A super-intelligent AI walks out of a box...
Q: Why did the psychopathic utilitarian push a fat man in front of a trolley?
A: Just for fun.
The official story: "Fifty Shades of Grey" was a Twilight fan-fiction that had over two million downloads online. The publishing giant Vintage Press saw that number and realized there was a huge, previously-unrealized demand for stories like this. They filed off the Twilight serial numbers, put it in print, marketed it like hell, and now it's sold 60 million copies.
The reality is quite different.
I'd like to gauge interest in an (english-language) Tokyo area meetup - given Tokyo's size, if a couple people are interested, it would be good to pick a location/day that's convenient for everybody. Otherwise I will announce a date and time and wait in a cafe with a book hoping that somebody will turn up.
I have been to several LW gatherings and have met consistently awesome and nice people, so if any Tokyo lurkers are reading this, I can assure you it's totally worth it to come! Please make yourself heard in the comments if you are interested.
I don't know very much model theory, and thus I don't fully understand Hutter et al.'s logical prior, detailed here, but nonetheless I can tell you that it uses a very top-down approach. About 60% of what I mean is that the prior is presented as a completed object with few moving parts, which fits the authors' mathematical tastes and proposed abstract properties the function should have. And for another thing, it uses model theory - a dead giveaway.
There are plenty of reasons to take a top-down approach. Yes, Hutter et al.'s function isn't computable, but sometimes the properties you want require uncomputability. And it's easier to come up with something vaguely satisfactory if you don't have to have many moving parts. This can range from "the prior is defined as a thing that fulfills the properties I want" on the lawful good side of the spectrum, to "clearly the right answer is just the exponential of the negative complexity of the statement, duh".
Probably the best reason to use a top-down approach to logical uncertainty is so you can do math to it. When you have some elegant description of global properties, it's a lot easier to prove that your logical probability function has nice properties, or to use it in abstract proofs. Hence why model theory is a dead giveaway.
There's one other advantage to designing a logical prior from the top down, which is that you can insert useful stuff like a complexity penalty without worrying too much. After all, you're basically making it up as you go anyhow, you don't have to worry about where it comes from like you would if you were going form the bottom up.
A bottom-up approach, by contrast, starts with an imagined agent with some state of information and asks what the right probabilities to assign are. Rather than pursuing mathematical elegance, you'll see a lot of comparisons to what humans do when reasoning through similar problems, and demands for computability from the outset.
For me, a big opportunity of the bottom-up approach is to use desiderata that look like principles of reasoning. This leads to more moving parts, but also outlaws some global properties that don't have very compelling reasons behind them.
Before we get to the similarities, rather than the differences, we'll have to impose the condition of limited computational resources. A common playing field, as it were. It would probably serve just as well to extend bottom-up approaches to uncomputable heights, but I am the author here, and I happen to be biased towards the limited-resources case.
The part of top-down assignment using limited resources will be played by a skeletonized pastiche of Paul Christiano's recent report:
i. No matter what, with limited resources we can only assign probabilities to a limited pool of statements. Accordingly, step one is to use some process to choose the set S0 of statements (and their negations) to assign probabilities.
ii. Then we use something a weakened consistency condition (that can be decided between pairs of sentences in polynomial time) to set constraints on the probability function over S0. For example, sentences that are identical except for a double-negation have to be given the same probability.
iii. Christiano constructs a description-length-based "pre-prior" function that is bigger for shorter sentences. There are lots of options for different pre-priors, and I think this is a pretty good one.
iv. Finally, assign a logical probability function over S0 that is as similar as possible to the pre-prior while fulfilling the consistency condition. Christiano measures similarity using cross-entropy between the two functions, so that the problem is one of minimizing cross-entropy subject to a finite list of constraints. (Even if the pre-prior decreases exponentially, this doesn't mean that complicated statements will have exponentially low logical probability, because of the condition from step two that P(a statement) + P(its negation) = 1 - in a state of ignorance, everything still gets probability 1/2. The pre-prior only kicks in when there are more options with different description lengths.)
Next, let's look at the totally different world of a bottom-up assignment of logical probabilities, played here by a mildly rephrased version of my past proposal.
i. Pick a set of sentences S1 to try and figure out the logical probabilities of.
ii. Prove the truth or falsity of a bunch of statements in the closure of S1 under conjugation and negation (i.e. if sentences a and b are in S1, a&b is in the closure of S1).
iii. Assign a logical probability function over the closure of S1 under conjugation with maximum entropy, subject to the constraints proved in part two, plus the constraints that each sentence && its negation has probability 0.
These turn out to be really similar! Look in step three of my bottom-up example - there's a even a sneakily-inserted top-down condition about going through every single statement and checking an aspect of consistency. In the top-down approach, every theorem of a certain sort is proved, while in the bottom-up approach there are allowed to be lots of gaps - but the same sorts of theorems are proved. I've portrayed one as using proofs only about sentences in S0, and the other as using proofs in the entire closure of S1 under conjunction, but those are just points on an available continuum (for more discussion, see Christiano's section on positive semidefinite methods).
The biggest difference is this "pre-prior" thing. On the one hand, it's essential for giving us guarantees about inductive learning. On the other hand, what piece of information do we have that tells us that longer sentences really are less likely? I have unresolved reservations, despite the practical advantages.
A minor confession - my choice of Christiano's report was not coincidental at all. The causal structure went like this:
Last week - Notice dramatic similarities in what gets proved and how it gets used between my bottom-up proposal and Christiano's top-down proposal.
Now - Write post talking about generalities of top-down and bottom-up approaches to logical probability, and then find as a startling conclusion the thing that motivated me to write the post in the first place.
The teeensy bit of selection bias here means that though these similarities are cool, it's hard to draw general conclusions.
So let's look at one more proposal, this one due to Abram Demski, modified by to use limited resources.
i. Pick a set of sentences S2 to care about.
ii. Construct a function on sentences in S2 that is big for short sentences and small for long sentences.
iii. Start with the set of sentences that are axioms - we'll shortly add new sentences to the set.
iv. Draw a sentence from S2 with probability proportional to the function from step two.
v. Do a short consistency check (can use a weakened consistency condition, or just limited time) between this sentence and the sentences already in the set. If it's passed, add the sentence to the set.
vi. Keep doing steps four and five until you've either added or ruled out all the sentences in S2.
vii. The logical probability of a sentence is defined as the probability that it ends up in our set after going through this process. We can find this probability using Monte Carlo by just running the process a bunch of times and counting up what portion of the time each sentences is in the set by the end.
Okay, so this one looks pretty different. But let's look for the similarities. The exact same kinds of things get proved again - weakened or scattershot consistency checks between different sentences. If all you have in S2 are three mutually exclusive and exhaustive sentences, the one that's picked first wins - meaning that the probability function over what sentence gets picked first is acting like our pre-prior.
So even though the method is completely different, what's really going on is that sentences are being given measure that looks like the pre-prior, subject to the constraints of weakened consistency (via rejection sampling) and normalization (keep repeating until all statements are checked).
In conclusion: not everything is like everything else, but some things are like some other things.
Summary: I don't think 'politics is the mind-killer' works well rthetorically. I suggest 'politics is hard mode' instead.
My usual first objection is that it seems odd to single politics out as a “mind-killer” when there’s plenty of evidence that tribalism happens everywhere. Recently, there has been a whole kerfuffle within the field of psychology about replication of studies. Of course, some key studies have failed to replicate, leading to accusations of “bullying” and “witch-hunts” and what have you. Some of the people involved have since walked their language back, but it was still a rather concerning demonstration of mind-killing in action. People took “sides,” people became upset at people based on their “sides” rather than their actual opinions or behavior, and so on.
Unless this article refers specifically to electoral politics and Democrats and Republicans and things (not clear from the wording), “politics” is such a frightfully broad category of human experience that writing it off entirely as a mind-killer that cannot be discussed or else all rationality flies out the window effectively prohibits a large number of important issues from being discussed, by the very people who can, in theory, be counted upon to discuss them better than most. Is it “politics” for me to talk about my experience as a woman in gatherings that are predominantly composed of men? Many would say it is. But I’m sure that these groups of men stand to gain from hearing about my experiences, since some of them are concerned that so few women attend their events.
In this article, Eliezer notes, “Politics is an important domain to which we should individually apply our rationality — but it’s a terrible domain in which to learn rationality, or discuss rationality, unless all the discussants are already rational.” But that means that we all have to individually, privately apply rationality to politics without consulting anyone who can help us do this well. After all, there is no such thing as a discussant who is “rational”; there is a reason the website is called “Less Wrong” rather than “Not At All Wrong” or “Always 100% Right.” Assuming that we are all trying to be more rational, there is nobody better to discuss politics with than each other.
The rest of my objection to this meme has little to do with this article, which I think raises lots of great points, and more to do with the response that I’ve seen to it — an eye-rolling, condescending dismissal of politics itself and of anyone who cares about it. Of course, I’m totally fine if a given person isn’t interested in politics and doesn’t want to discuss it, but then they should say, “I’m not interested in this and would rather not discuss it,” or “I don’t think I can be rational in this discussion so I’d rather avoid it,” rather than sneeringly reminding me “You know, politics is the mind-killer,” as though I am an errant child. I’m well-aware of the dangers of politics to good thinking. I am also aware of the benefits of good thinking to politics. So I’ve decided to accept the risk and to try to apply good thinking there. [...]
I’m sure there are also people who disagree with the article itself, but I don’t think I know those people personally. And to add a political dimension (heh), it’s relevant that most non-LW people (like me) initially encounter “politics is the mind-killer” being thrown out in comment threads, not through reading the original article. My opinion of the concept improved a lot once I read the article.
In the same thread, Andrew Mahone added, “Using it in that sneering way, Miri, seems just like a faux-rationalist version of ‘Oh, I don’t bother with politics.’ It’s just another way of looking down on any concerns larger than oneself as somehow dirty, only now, you know, rationalist dirty.” To which Miri replied: “Yeah, and what’s weird is that that really doesn’t seem to be Eliezer’s intent, judging by the eponymous article.”
Eliezer replied briefly, to clarify that he wasn't generally thinking of problems that can be directly addressed in local groups (but happen to be politically charged) as "politics":
Hanson’s “Tug the Rope Sideways” principle, combined with the fact that large communities are hard to personally influence, explains a lot in practice about what I find suspicious about someone who claims that conventional national politics are the top priority to discuss. Obviously local community matters are exempt from that critique! I think if I’d substituted ‘national politics as seen on TV’ in a lot of the cases where I said ‘politics’ it would have more precisely conveyed what I was trying to say.
But that doesn't resolve the issue. Even if local politics is more instrumentally tractable, the worry about polarization and factionalization can still apply, and may still make it a poor epistemic training ground.
A subtler problem with banning “political” discussions on a blog or at a meet-up is that it’s hard to do fairly, because our snap judgments about what counts as “political” may themselves be affected by partisan divides. In many cases the status quo is thought of as apolitical, even though objections to the status quo are ‘political.’ (Shades of Pretending to be Wise.)
Because politics gets personal fast, it’s hard to talk about it successfully. But if you’re trying to build a community, build friendships, or build a movement, you can’t outlaw everything ‘personal.’
And selectively outlawing personal stuff gets even messier. Last year, daenerys shared anonymized stories from women, including several that discussed past experiences where the writer had been attacked or made to feel unsafe. If those discussions are made off-limits because they relate to gender and are therefore ‘political,’ some folks may take away the message that they aren’t allowed to talk about, e.g., some harmful or alienating norm they see at meet-ups. I haven’t seen enough discussions of this failure mode to feel super confident people know how to avoid it.
Since this is one of the LessWrong memes that’s most likely to pop up in cross-subcultural dialogues (along with the even more ripe-for-misinterpretation “policy debates should not appear one-sided“…), as a first (very small) step, my action proposal is to obsolete the ‘mind-killer’ framing. A better phrase for getting the same work done would be ‘politics is hard mode’:
1. ‘Politics is hard mode’ emphasizes that ‘mind-killing’ (= epistemic difficulty) is quantitative, not qualitative. Some things might instead fall under Middlingly Hard Mode, or under Nightmare Mode…
2. ‘Hard’ invites the question ‘hard for whom?’, more so than ‘mind-killer’ does. We’re used to the fact that some people and some contexts change what’s ‘hard’, so it’s a little less likely we’ll universally generalize.
3. ‘Mindkill’ connotes contamination, sickness, failure, weakness. In contrast, ‘Hard Mode’ doesn’t imply that a thing is low-status or unworthy. As a result, it’s less likely to create the impression (or reality) that LessWrongers or Effective Altruists dismiss out-of-hand the idea of hypothetical-political-intervention-that-isn’t-a-terrible-idea. Maybe some people do want to argue for the thesis that politics is always useless or icky, but if so it should be done in those terms, explicitly — not snuck in as a connotation.
4. ‘Hard Mode’ can’t readily be perceived as a personal attack. If you accuse someone of being ‘mindkilled’, with no context provided, that smacks of insult — you appear to be calling them stupid, irrational, deluded, or the like. If you tell someone they’re playing on ‘Hard Mode,’ that’s very nearly a compliment, which makes your advice that they change behaviors a lot likelier to go over well.
5. ‘Hard Mode’ doesn’t risk bringing to mind (e.g., gendered) stereotypes about communities of political activists being dumb, irrational, or overemotional.
6. ‘Hard Mode’ encourages a growth mindset. Maybe some topics are too hard to ever be discussed. Even so, ranking topics by difficulty encourages an approach where you try to do better, rather than merely withdrawing. It may be wise to eschew politics, but we should not fear it. (Fear is the mind-killer.)
7. Edit: One of the larger engines of conflict is that people are so much worse at noticing their own faults and biases than noticing others'. People will be relatively quick to dismiss others as 'mindkilled,' while frequently flinching away from or just-not-thinking 'maybe I'm a bit mindkilled about this.' Framing the problem as a challenge rather than as a failing might make it easier to be reflective and even-handed.
This is not an attempt to get more people to talk about politics. I think this is a better framing whether or not you trust others (or yourself) to have productive political conversations.
When I playtested this post, Ciphergoth raised the worry that 'hard mode' isn't scary-sounding enough. As dire warnings go, it's light-hearted—exciting, even. To which I say: good. Counter-intuitive fears should usually be argued into people (e.g., via Eliezer's politics sequence), not connotation-ninja'd or chanted at them. The cognitive content is more clearly conveyed by 'hard mode,' and if some group (people who love politics) stands to gain the most from internalizing this message, the message shouldn't cast that very group (people who love politics) in an obviously unflattering light. LW seems fairly memetically stable, so the main issue is what would make this meme infect friends and acquaintances who haven't read the sequences. (Or Dune.)
If you just want a scary personal mantra to remind yourself of the risks, I propose 'politics is SPIDERS'. Though 'politics is the mind-killer' is fine there too.
If you and your co-conversationalists haven’t yet built up a lot of trust and rapport, or if tempers are already flaring, conveying the message ‘I’m too rational to discuss politics’ or ‘You’re too irrational to discuss politics’ can make things worse. In that context, ‘politics is the mind-killer’ is the mind-killer. At least, it’s a needlessly mind-killing way of warning people about epistemic hazards.
‘Hard Mode’ lets you speak as the Humble Aspirant rather than the Aloof Superior. Strive to convey: ‘I’m worried I’m too low-level to participate in this discussion; could you have it somewhere else?’ Or: ‘Could we talk about something closer to Easy Mode, so we can level up together?’ More generally: If you’re worried that what you talk about will impact group epistemology, you should be even more worried about how you talk about it.
In my opinion, living anywhere other than the center of your industry is a mistake. A lot of people — those who don’t live in that place — don’t want to hear it. But it’s true. Geographic locality is still — even in the age of the Internet — critically important if you want to maximize your access to the best companies, the best people, and the best opportunities. You can always cite exceptions, but that’s what they are: exceptions.
- Marc Andreessen
Like many people in the technology industry, I have been thinking seriously about moving to the Bay Area. However, before I decide to move, I want to do a lot of information gathering. Some basic pieces of information - employment prospects, cost of living statistics, and weather averages - can be found online. But I feel that one's quality of life is determined by a large number of very subtle factors - things like walkability, public transportation, housing quality/dollar of rent, lifestyle options, and so on. These kinds of things seem to require first-hand, in-person examination. For that reason, I'm planning to visit the Bay Area and do an in-depth exploration next month, August 20th-24th.
My guess is that a significant number of LWers are also thinking about moving to the Bay Area, and so I wanted to invite people to accompany me in this exploration. Here are some activities we might do:
- Travel around using public transportation. Which places are convenient to get from/to, and which places aren't?
- Visit the offices of the major tech companies like Google, Facebook, Apple, and Twitter. Ask some of their employees how they feel about being a software engineer in Silicon Valley.
- Eat at local restaurants - not so much the fancy/expensive ones, but the ones a person might go to for a typical, everyday lunch outing.
- See some of the sights. Again, the emphasis would be on the things that would affect our everyday lifestyle, should be decide to move, not so much on the tourist attractions. For example, the Golden Gate Bridge is an awesome structure, but I doubt it would improve my everyday life very much. In contrast, living near a good running trail would be a big boost to my lifestyle.
- Do some apartment viewing, to get a feel for how much rent a good/medium/student apartment costs in different areas and how good the amenities are.
- Go to some local LW meetups, if there are any scheduled for the time window.
- Visit the Stanford and UC Berkeley campuses and the surrounding areas.
- Interact with locals and ask them about their experience living in the region
- Visit a number of different neighborhoods, to try to get a sense of the pros and cons of each
- Discuss how to apply Bayesian decision theory to the problem of finding the optimal place to live ;)
If it's worth saying, but not worth its own post (even in Discussion), then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one.
3. Open Threads should be posted in Discussion, and not Main.
4. Open Threads should start on Monday, and end on Sunday.
The following simple game has one solution that seems correct, but isn’t. Can you figure out why?
Player One moves first. He must pick A, B, or C. If Player One picks A the game ends and Player Two does nothing. If Player One picks B or C, Player Two will be told that Player One picked B or C, but will not be told which of these two strategies Player One picked, Player Two must then pick X or Y, and then the game ends. The following shows the Players’ payoffs for each possible outcome. Player One’s payoff is listed first.
A 3,0 [And Player Two never got to move.]
There has been some talk of a lack of content being posted to Less Wrong, so I decided to start a series on various experiments that I've tried and what I've learned from them as I believe that experimentation is key to being a rationalist. My first few posts will be adapted from content I've written for /r/socialskills, but as Less Wrong has a broader scope I plan to post some original content too. I hope that this post will encourage other people to share detailed descriptions of the experiments that they have tried as I believe that this is much more valuable than a list of lessons posted outside of the context in which they were learned. If anyone has already posted any similar posts, then I would really appreciate any links.
I used to have a lot of trouble in conversation thinking of things to say. I wanted to be a more interesting person and I noticed that my brother uses his knowledge of a broad range of topics to engage people in conversations, so I wanted to do the same.
I was drawn quite quickly towards facts because of how quickly they can be read. If a piece of trivia takes 10 seconds to read, then you can read 360 in an hour. If only 5% are good, then that's still 18 usable facts per hour. Articles are longer, but have significantly higher chances of teaching you something. It seemed like you should be able to prevent ever running out of things to talk about with a reasonable investment of time. It didn't quite work out this way, but this was the idea.d
Another motivation was that I have always valued intelligence and learning more information made me feel good about myself.
Today I learned: #1 recommended source
The straight dope: Many articles in the archive are quite interesting, but I unsubscribed because I found the more recent ones boring
Cracked: Not the most reliable source and can be a huge time sink, but occasionally there are articles there that will give you 6 or 7 interesting facts in one go
Dr Karl: Science blog
I read through the top 1000 links on Today I learned, the entire archive of the straight dope, maybe half of damn interesting and now I know, half of Karl and all the mythbusters results up to about a year or two ago. We are pretty much talking about months of solid reading.
You probably guessed it, but my return on investment wasn't actually that great. I tended to consume this trivia in ridiculously huge batches because by reading all this information I at least felt like I was doing something. If someone came up to me and asked me for a random piece of trivia - I actually don't have that much that I can pull out. It's actually much easier if someone asks about a specific topic, but there's still not that much I can access.
To test my knowledge I decided to pick the first three topics that came into my head and see how much random trivia I could remember about each. As you can see, the results were rather disappointing:
- Cats can survive falls from a higher number of floors better than a lower number of falls because they have a low terminal velocity and more time to orient themselves to ensure they land on their feet
- House cats can run faster than Ursain bolt
- If you are attacked by a dog the best strategy is to shove your hand down its mouth and attack the neck with your other hand
- Dogs can be trained to drive cars (slowly)
- There is such a thing as the world's ugliest dog competition
- Cheese is poisonous to rats
- The existence of rat kings - rats who got their tails stuck together
Knowing these facts does occasionally help me by giving me something interesting to say when I wouldn't have otherwise had it, but quite often I want to quote one of these facts, but I can't quite remember the details. It's hard to quantify how much this helps me though. There have been a few times when I've been able to get someone interested in a conversation that they wouldn't have otherwise been interested in, but I can also go a dozen conversations without quoting any of these facts. No-one has ever gone "Wow, you know so many facts!". Another motivation I had was that being knowledgeable makes me feel good about myself. I don't believe that there was any significant impact in this regard either - I don't have a strong self-concept of myself as someone who is particularly knowledgeable about random facts. Overall this experiment was quite disappointing given the high time investment.
While the social benefits have been extremely minimal, learning all of these facts has expanded my world view.
- I had no idea how crazy nature was: most surprising fact I've learned is that Bluebottles are multiple organisms
- Some of the stuff that the CIA got up to is unbelievable - you'd almost think it came from a conspiracy theorist
- There are many things that you take for granted, but when you think about it, are actually amazing coincidences - moon and sun appearing around the same size
- You don't want to get on the wrong side of the law as it can be horribly unjust
- The government is pretty careless with nuclear weapons. If we can't trust the government can't look after nukes, what can we trust them to look after?
While this technique worked poorly for me, there are many changes that I could have made that might have improved effectiveness.
- Lower batch sizes: when you read too many facts in one go you get tired and it all tends to blur together
- Notes: I started making notes of the most interesting facts I was finding using Evernote. I regularly add new facts, but only very occasionally go back and actually look them up. I was trying to review the new facts that I learned regularly, but I got busy and just fell out of the habit. Perhaps I could have a separate list for the most important facts I learn every week and this would be less effort?
- Rereading saved facts: I did a complete reread through my saved notes once. I still don't think that I have a very good recall - probably related to batch size!
- Spaced repetition: Many people claim that this make memorisation easy
- Thoughtback: This is a lighter alternative to spaced repetition - it gives you notifications on your phone of random facts - about one per day
- Talking to other people: This is a very effective method for remembering facts. That vast majority of facts that I've shared with other people, I still remember. Perhaps I should create a list of facts that I want to remember and then pick one or two at a time to share with people. Once I've shared them a few times, I could move on to the next fact
- Blog posts - perhaps if I collected some of my related facts into blog posts, having to decide which to include and which to not include my help me remember these facts more
- Pausing: I find that I am more likely to remember things if I pause and think that this is something that I want to remember. I was trying to build that habit, but I didn't succeed in this
- Other memory techniques: brains are better at remembering things if you process them. So if you want to remember the story where thieves stole a whole beach in one night, try to picture the beach and then the shock when some surfer turns up and all the sand is gone. Try to imagine what you'd need to pull that off.
I believe that if I had spread my reading out over a greater period of time, then the cost would have been justified. Part of this would have been improved retention and part of this would have been having a new interesting fact to use in conversation every week that I know I hadn't told anyone else before.
The social benefits are rather minimal, so it would be difficult to get them to match up with the time invested. I believe that with enough refinement, someone could improve their effectiveness to the stage where the benefits matched up with the effort invested, but broadening one's knowledge will always be the primary advantage gained.
As most of you may already know, the plane that recently crashed on disputed Ukrainian soil carried some of the world's top HIV researchers.
One part of me holds vehemently that all human beings are of equal value.
Another part of me wishes there could be extra-creative punishments for depriving the world of its best minds.
I asked this question on Facebook here, and got some interesting answers, but I thought it would be interesting to ask LessWrong and get a larger range of opinions. I've modified the list of options somewhat.
What explains why some classification, prediction, and regression methods are common in academic social science, while others are common in machine learning and data science?
For instance, I've encountered probit models in some academic social science, but not in machine learning.
The main algorithms that I believe are common to academic social science and machine learning are the most standard regression algorithms: linear regression and logistic regression.
Possibilities that come to mind:
(0) My observation is wrong and/or the whole question is misguided.
(1) The focus in machine learning is on algorithms that can perform well on large data sets. Thus, for instance, probit models may be academically useful but don't scale up as well as logistic regression.
(2) Academic social scientists take time to catch up with new machine learning approaches. Of the methods mentioned above, random forests and support vector machines was introduced as recently as 1995. Neural networks are older but their practical implementation is about as recent. Moreover, the practical implementations of these algorithm in the standard statistical softwares and packages that academics rely on is even more recent. (This relates to point (4)).
(3) Academic social scientists are focused on publishing papers, where the goal is generally to determine whether a hypothesis is true. Therefore, they rely on approaches that have clear rules for hypothesis testing and for establishing statistical significance (see also this post of mine). Many of the new machine learning approaches don't have clearly defined statistical approaches for significance testing. Also, the strength of machine learning approaches is more exploratory than testing already formulated hypotheses (this relates to point (5)).
(4) Some of the new methods are complicated to code, and academic social scientists don't know enough mathematics, computer science, or statistics to cope with the methods (this may change if they're taught more about these methods in graduate school, but the relative newness of the methods is a factor here, relating to (2)).
(5) It's hard to interpret the results of fancy machine learning tools in a manner that yields social scientific insight. The results of a linear or logistic regression can be interpreted somewhat intuitively: the parameters (coefficients) associated with individual features describe the extent to which those features affect the output variable. Modulo issues of feature scaling, larger coefficients mean those features play a bigger role in determining the output. Pairwise and listwise R^2 values provide additional insight on how much signal and noise there is in individual features. But if you're looking at a neural network, it's quite hard to infer human-understandable rules from that. (The opposite direction is not too hard: it is possible to convert human-understandable rules to a decision tree and then to use a neural network to approximate that, and add appropriate fuzziness. But the neural networks we obtain as a result of machine learning optimization may be quite different from those that we can interpret as humans). To my knowledge, there haven't been attempts to reinterpret neural network results in human-understandable terms, though Sebastian Kwiatkowski's comment on my Facebook post points to an example where the results of naive Bayes and SVM classifiers for hotel reviews could be translated into human-understandable terms (namely, reviews that mentioned physical aspects of the hotel, such as "small bedroom", were more likely to be truthful than reviews that talked about the reasons for the visit or the company that sponsored the visit). But Kwiatkowski's comment also pointed to other instances where the machine's algorithms weren't human-interpretable.
What's your personal view on my main question, and on any related issues?
In early 2000, I registered my personal domain name weidai.com, along with a couple others, because I was worried that the small (sole-proprietor) ISP I was using would go out of business one day and break all the links on the web to the articles and software that I had published on my "home page" under its domain. Several years ago I started getting offers, asking me to sell the domain, and now they're coming in almost every day. A couple of days ago I saw the first six figure offer ($100,000).
In early 2009, someone named Satoshi Nakamoto emailed me personally with an announcement that he had published version 0.1 of Bitcoin. I didn't pay much attention at the time (I was more interested in Less Wrong than Cypherpunks at that point), but then in early 2011 I saw a LW article about Bitcoin, which prompted me to start mining it. I wrote at the time, "thanks to the discussion you started, I bought a Radeon 5870 and started mining myself, since it looks likely that I can at least break even on the cost of the card." That approximately $200 investment (plus maybe another $100 in electricity) is also worth around six figures today.
Clearly, technological advances can sometimes create gold rush-like situations (i.e., first-come-first-serve opportunities to make truly extraordinary returns with minimal effort or qualifications). And it's possible to stumble into them without even trying. Which makes me think, maybe we should be trying? I mean, if only I had been looking for possible gold rushes, I could have registered a hundred domain names optimized for potential future value, rather than the few that I happened to personally need. Or I could have started mining Bitcoins a couple of years earlier and be a thousand times richer.
I wish I was already an experienced gold rush spotter, so I could explain how best to do it, but as indicated above, I participated in the ones that I did more or less by luck. Perhaps the first step is just to keep one's eyes open, and to keep in mind that tech-related gold rushes do happen from time to time and they are not impossibly difficult to find. What other ideas do people have? Are there other past examples of tech gold rushes besides the two that I mentioned? What might be some promising fields to look for them in the future?
Granted, writing is not very effective. But some of us just love writing...
Earning to Give Writing: Which are the places that pay 1USD or more dollars per word?
Clarification Writing: What needs being written because it is only through writing that these ideas will emerge in the first place?
What should we be writing about if we have already been, for very long, training the craft? What has not yet been written, what is the new thing?
I recently realized that, encouraged by LessWrong, I had been using a heuristic in my philosophical reasoning that I now think is suspect. I'm not accusing anybody else of falling into the same trap; I'm just recounting my own situation for the benefit of all.
I actually am not 100% sure that the heuristic is wrong. I hope that this discussion about it generalizes into a conversation about intuition and the relationship between FAI epistemology and our own epistemology.
The heuristic is this: If the ideal FAI would think a certain way, then I should think that way as well. At least in epistemic matters, I should strive to be like an ideal FAI.
Examples of the heuristic in use are:
--The ideal FAI wouldn't care about its personal identity over time; it would have no problem copying itself and deleting the original as the need arose. So I should (a) not care about personal identity over time, even if it exists, and (b) stop believing that it exists.
--The ideal FAI wouldn't care about its personal identity at a given time either; if it was proven that 99% of all observers with its total information set were in fact Boltzmann Brains, then it would continue to act as if it were not a Boltzmann Brain, since that's what maximizes utility. So I should (a) act as if I'm not a BB even if I am one, and (b) stop thinking it is even a meaningful possibility.
--The ideal FAI would think that the specific architecture it is implemented on (brains, computers, nanomachines, giant look-up tables) is irrelevant except for practical reasons like resource efficiency. So, following its example, I should stop worrying about whether e.g. a simulated brain would be conscious.
--The ideal FAI would think that it was NOT a "unified subject of experience" or an "irreducible substance" or that it was experiencing "ineffable, irreducible quale," because believing in those things would only distract it from understanding and improving its inner workings. Therefore, I should think that I, too, am nothing but a physical mechanism and/or an algorithm implemented somewhere but capable of being implemented elsewhere.
--The ideal FAI would use UDT/TDT/etc. Therefore I should too.
--The ideal FAI would ignore uncomputable possibilities. Therefore I should too.
Arguably, most if not all of the conclusions I drew in the above are actually correct. However, I think that the heuristic is questionable, for the following reasons:
(1) Sometimes what we think of as the ideal FAI isn't actually ideal. Case in point: The final bullet above about uncomputable possibilities. We intuitively think that uncomputable possibilites ought to be countenanced, so rather than overriding our intuition when presented with an attractive theory of the ideal FAI (in this case AIXI) perhaps we should keep looking for an ideal that better matches our intuitions.
(2) The FAI is a tool for serving our wishes; if we start to think of ourselves as being fundamentally the same sort of thing as the FAI, our values may end up drifting badly. For simplicity, let's suppose the FAI is designed to maximize happy human life-years. The problem is, we don't know how to define a human. Do simulated brains count? What about patterns found inside rocks? What about souls, if they exist? Suppose we have the intuition that humans are indivisible entities that persist across time. If we reason using the heuristic I am talking about, we would decide that, since the FAI doesn't think it is an indivisible entity that persists across time, we shouldn't think we are either. So we would then proceed to tell the FAI "Humans are naught but a certain kind of functional structure," and (if our overruled intuition was correct) all get killed.
Note 1: "Intuitions" can (I suspect) be thought of as another word for "Priors."
Note 2: We humans are NOT solomonoff-induction-approximators, as far as I can tell. This bodes ill for FAI, I think.
- Australia - Online Hangout: 13 July 2014 06:30PM
- Frankfurt: Goal Factoring: 20 July 2014 02:00PM
- Houston, TX: 12 July 2014 02:00PM
- [Portland] Calibration Training and Potluck - Portland: 12 July 2014 06:31PM
- Upper Canada LW Megameetup: Ottawa, Toronto, Montreal, Waterloo, London: 18 July 2014 07:00PM
The remaining meetups take place in cities with regular scheduling, but involve a change in time or location, special meeting content, or simply a helpful reminder about the meetup:
- Brussels - July meetup: 12 July 2014 01:00PM
- Brussels - August (topic TBD): 09 August 2014 01:00PM
- Canberra: Paranoid Debating: 12 July 2014 06:00PM
- London social meetup - possibly in a park: 13 July 2014 02:00PM
- Sydney Meetup - July: 23 July 2014 07:00PM
- Washington, D.C.: Prisoner's Dilemna tournament: 13 July 2014 03:00PM
Locations with regularly scheduled meetups: Austin, Berkeley, Berlin, Boston, Brussels, Buffalo, Cambridge UK, Canberra, Columbus, London, Madison WI, Melbourne, Mountain View, New York, Philadelphia, Research Triangle NC, Salt Lake City, Seattle, Sydney, Toronto, Vienna, Washington DC, Waterloo, and West Los Angeles. There's also a 24/7 online study hall for coworking LWers.
WARNING: Memetic hazard.
Is there anything we should do?
Analogy gets a bad rap around here, and not without reason. The kinds of argument from analogy condemned in the above links fully deserve the condemnation they get. Still, I think it's too easy to read them and walk away thinking "Boo analogy!" when not all uses of analogy are bad. The human brain seems to have hardware support for thinking in analogies, and I don't think this capability is a waste of resources, even in our highly non-ancestral environment. So, assuming that the linked posts do a sufficient job detailing the abuse and misuse of analogy, I'm going to go over some legitimate uses.
The first thing analogy is really good for is description. Take the plum pudding atomic model. I still remember this falsified proposal of negative 'raisins' in positive 'dough' largely because of the analogy, and I don't think anyone ever attempted to use it to argue for the existence of tiny subnuclear particles corresponding to cinnamon.
But this is only a modest example of what analogy can do. The following is an example that I think starts to show the true power: my comment on Robin Hanson's 'Don't Be "Rationalist"'. To summarize, Robin argued that since you can't be rationalist about everything you should budget your rationality and only be rational about the most important things; I replied that maybe rationality is like weightlifting, where your strength is finite yet it increases with use. That comment is probably the most successful thing I've ever written on the rationalist internet in terms of the attention it received, including direct praise from Eliezer and a shoutout in a Scott Alexander (yvain) post, and it's pretty much just an analogy.
Here's another example, this time from Eliezer. As part of the AI-Foom debate, he tells the story of Fermi's nuclear experiments, and in particular his precise knowledge of when a pile would go supercritical.
What do the above analogies accomplish? They provide counterexamples to universal claims. In my case, Robin's inference that rationality should be spent sparingly proceeded from the stated premise that no one is perfectly rational about anything, and weightlifting was a counterexample to the implicit claim 'a finite capacity should always be directed solely towards important goals'. If you look above my comment, anon had already said that the conclusion hadn't been proven, but without the counterexample this claim had much less impact.
In Eliezer's case, "you can never predict an unprecedented unbounded growth" is the kind of claim that sounds really convincing. "You haven't actually proved that" is a weak-sounding retort; "Fermi did it" immediately wins the point.
The final thing analogies do really well is crystallize patterns. For an example of this, let's turn to... Failure by Analogy. Yep, the anti-analogy posts are themselves written almost entirely via analogy! Alchemists who glaze lead with lemons and would-be aviators who put beaks on their machines are invoked to crystallize the pattern of 'reasoning by similarity'. The post then makes the case that neural-net worshippers are reasoning by similarity in just the same way, making the same fundamental error.
It's this capacity that makes analogies so dangerous. Crystallizing a pattern can be so mentally satisfying that you don't stop to question whether the pattern applies. The antidote to this is the question, "Why do you believe X is like Y?" Assessing the answer and judging deep similarities from superficial ones may not always be easy, but just by asking you'll catch the cases where there is no justification at all.
In experiments performed on mice, blood transfusions from young mice reversed age-related markers in older mice. The protein involved is identical in humans.
This is the public group instrumental rationality diary for July 16-31.
It's a place to record and chat about it if you have done, or are actively doing, things like:
- Established a useful new habit
- Obtained new evidence that made you change your mind about some belief
- Decided to behave in a different way in some set of situations
- Optimized some part of a common routine or cached behavior
- Consciously changed your emotions or affect with respect to something
- Consciously pursued new valuable information about something that could make a big difference in your life
- Learned something new about your beliefs, behavior, or life that surprised you
- Tried doing any of the above and failed
Or anything else interesting which you want to share, so that other people can think about it, and perhaps be inspired to take action themselves. Try to include enough details so that everyone can use each other's experiences to learn about what tends to work out, and what doesn't tend to work out.
Thanks to cata for starting the Group Rationality Diary posts, and to commenters for participating.
Previous diary: July 1-15
Here is an interesting blog post about a guy who did a resume experiment between two positions which he argues are by experience identical, but occupy different "social status" positions in tech: A software engineer and a data manager.
Interview A: as Software Engineer
Bill faced five hour-long technical interviews. Three went well. One was so-so, because it focused on implementation details of the JVM, and Bill’s experience was almost entirely in C++, with a bit of hobbyist OCaml. The last interview sounds pretty hellish. It was with the VP of Data Science, Bill’s prospective boss, who showed up 20 minutes late and presented him with one of those interview questions where there’s “one right answer” that took months, if not years, of in-house trial and error to discover. It was one of those “I’m going to prove that I’m smarter than you” interviews...
Let’s recap this. Bill passed three of his five interviews with flying colors. One of the interviewers, a few months later, tried to recruit Bill to his own startup. The fourth interview was so-so, because he wasn’t a Java expert, but came out neutral. The fifth, he failed because he didn’t know the in-house Golden Algorithm that took years of work to discover. When I asked that VP/Data Science directly why he didn’t hire Bill (and he did not know that I knew Bill, nor about this experiment) the response I got was “We need people who can hit the ground running.” Apparently, there’s only a “talent shortage” when startup people are trying to scam the government into changing immigration policy. The undertone of this is that “we don’t invest in people”.
Or, for a point that I’ll come back to, software engineers lack the social status necessary to make others invest in them.
Interview B: as Data Science manager.
A couple weeks later, Bill interviewed at a roughly equivalent company for the VP-level position, reporting directly to the CTO.
Worth noting is that we did nothing to make Bill more technically impressive than for Company A. If anything, we made his technical story more honest, by modestly inflating his social status while telling a “straight shooter” story for his technical experience. We didn’t have to cover up periods of low technical activity; that he was a manager, alone, sufficed to explain those away.
Bill faced four interviews, and while the questions were behavioral and would be “hard” for many technical people, he found them rather easy to answer with composure. I gave him the Golden Answer, which is to revert to “There’s always a trade-off between wanting to do the work yourself, and knowing when to delegate.” It presents one as having managerial social status (the ability to delegate) but also a diligent interest in, and respect for, the work. It can be adapted to pretty much any “behavioral” interview question...
Bill passed. Unlike for a typical engineering position, there were no reference checks. The CEO said, “We know you’re a good guy, and we want to move fast on you”. As opposed tot he 7-day exploding offers typically served to engineers, Bill had 2 months in which to make his decision. He got a fourth week of vacation without even having to ask for it, and genuine equity (about 75% of a year’s salary vesting each year)...
It was really interesting, as I listened in, to see how different things are once you’re “in the club”. The CEO talked to Bill as an equal, not as a paternalistic, bullshitting, “this is good for your career” authority figure. There was a tone of equality that a software engineer would never get from the CEO of a 100-person tech company.
The author concludes that positions that are labeled as code-monkey-like are low status, while positions that are labeled as managerial are high status. Even if they are "essentially" doing the same sort of work.
Not sure about this methodology, but it's food for thought.
I have high confidence that economically-valuable self-replicating robots are possible with existing technology: initially, something similar in size and complexity to a RepRap, but able to assemble a copy of itself from parts ordered online with zero human interaction. This is important because more robots could provide the economic growth needed to solve many urgent problems. I've held this idea for long enough that I'm worried about being a crank, so any feedback is appreciated.
I care because to fulfill my naive and unrealistic dreams (not dying, owning a spaceship) I need the world to be a LOT richer. Specifically, naively assuming linear returns to medical research funding, a funding increase of ~10x (to ~$5 trillion/year, or ~30% of current USA GDP) is needed to achieve actuarial escape velocity (average lifespans currently increase by about 1 year each decade, so a 10x increase is needed for science to keep up with aging). The simplest way to get there is to have 10x as many machines per person.
My vision is that someone does for hardware what open-source has done for software: make useful tools free. A key advantage of software is that making a build or copying a program takes only one step. In software, you click "compile" and (hopefully) it's done and ready to test in seconds. In hardware, it takes a bunch of steps to build a prototype (order parts, screw fiddly bits together, solder, etc.). A week is an insanely short lead time for building a new prototype of something mechanical. 1-2 months is typical in many industries. This means that mechanical things have high marginal cost, because people have to build and debug them, and typically transport them for thousands of miles from factory to consumer.
Relevant previous research projects include trivial self-replication from pre-fabricated components and an overly-ambitious NASA-funded plan from the 1980s to develop the Moon using self-replicating robots. Current research funding tends to go toward bio-inspired systems, re-configurable systems using prefabricated cubes (conventionally-manufactured), or chemistry deceptively called "nanotech", all of which seem to miss the opportunity to use existing autonomous assembly technology with online ordering of parts to make things cheaper by getting rid of setup cost and building cost.
I envision a library/repository of useful robots for specific tasks (cleaning, manufacturing, etc.), in a standard format for download (parts list, 3D models, assembly instructions, etc.). Parts could be ordered online. A standard fabricator robot with the capability to identify and manipulate parts, and fasten them using screws, would verify that the correct parts were received, put everything together, and run performance checks. For comparison, the RepRap takes >9 hours of careful human labor to build. An initial self-replicating implementation would be a single fastener robot. It would spread by undercutting the price of competing robot arm systems. Existing systems sell for ~2x the cost of components, due to overhead for engineering, assembly, and shipping. This appears true for robots at a range of price points, including $200 robot arms using hobby servos and $40,000+ robot arms using optical encoders and direct-drive brushless motors. A successful system that undercut the price of conventionally-assembled hobby robots would provide a platform for hobbyists to create additional robots that could be autonomously built (e.g. a Roomba for 1/5 the price, due to not needing to pay the 5x markup for overhead and distribution). Once a beachhead is established in the form of a successful self-replicating assembly robot, market pressures would drive full automation of more products/industries, increasing output for everyone.
This is a very hard programming challenge, but the tools exist to identify, manipulate and assemble parts. Specifically, ROS is an open-source software library whose packages can be put together to solve tasks such as mapping a building or folding laundry. It's hard because it would require a lot of steps and a new combination of existing tools.
This is also a hard systems/mechanical challenge: delivering enough data and control bandwidth for observability and controllability, and providing lightweight and rigid hardware, so that the task for the software is possible rather than impossible. Low-cost components have less performance: a webcam has limited resolution, and hobby servos have limited accuracy. The key problem - autonomously picking up a screw and screwing it into a hole - has been solved years ago for assembly-line robots. Doing the same task with low-cost components appears possible in principle. A comparable problem that has been solved is autonomous construction using quadcopters.
Personally, I would like to build a robot arm that could assemble more robot arms. It would require, at minimum, a robot arm using hobby servos, a few webcams, custom grippers (for grasping screws, servos, and laser-cut sheet parts), custom fixtures (blocks with a cutout to hold two parts in place while the robot arm inserts a screw; ideally multiple robot arms would be used to minimize unique tooling but fixtures would be easier initially), and a lot of challenging code using ROS and Gazebo. Just the mechanical stuff, which I have the education for, would be a challenging months-long side project, and the software stuff could take years of study (the equivalent of a CS degree) before I'd have the required background to reasonably attempt it.
I'm not sure what to do with this idea. Getting a CS degree on top of a mechanical engineering degree (so I could know enough to build this) seems like a good career choice for interesting work and high pay (even if/when this doesn't work). Previous ideas like this I've had that are mostly outside my field have been unfeasible for reasons only someone familiar with the field would know. It's challenging to stay motivated to work on this, because the payoff is so distant, but it's also challenging not to work on this, because there's enough of a chance that this would work that I'm excited about it. I'm posting this here in the hopes someone with experience with industrial automation will be inspired to build this, and to get well-reasoned feedback.
This post explores the question: how strongly should we defer to predictions and forecasts made by people with domain expertise? I'll assume that the domain expertise is legitimate, i.e., the people with domain expertise do have a lot of information in their minds that non-experts don't. The information is usually not secret, and non-experts can usually access it through books, journals, and the Internet. But experts have more information inside their head, and may understand it better. How big an advantage does this give them in forecasting?
Tetlock and expert political judgment
In an earlier post on historical evaluations of forecasting, I discussed Philip E. Tetlock's findings on expert political judgment and forecasting skill, and summarized his own article for Cato Unbound co-authored with Dan Gardner that in turn summarized the themes of the book:
- The average expert’s forecasts were revealed to be only slightly more accurate than random guessing—or, to put more harshly, only a bit better than the proverbial dart-throwing chimpanzee. And the average expert performed slightly worse than a still more mindless competition: simple extrapolation algorithms that automatically predicted more of the same.
- The experts could be divided roughly into two overlapping yet statistically distinguishable groups. One group (the hedgehogs) would actually have been beaten rather soundly even by the chimp, not to mention the more formidable extrapolation algorithm. The other (the foxes) would have beaten the chimp and sometimes even the extrapolation algorithm, although not by a wide margin.
- The hedgehogs tended to use one analytical tool in many different domains; they preferred keeping their analysis simple and elegant by minimizing “distractions.” These experts zeroed in on only essential information, and they were unusually confident—they were far more likely to say something is “certain” or “impossible.” In explaining their forecasts, they often built up a lot of intellectual momentum in favor of their preferred conclusions. For instance, they were more likely to say “moreover” than “however.”
- The foxes used a wide assortment of analytical tools, sought out information from diverse sources, were comfortable with complexity and uncertainty, and were much less sure of themselves—they tended to talk in terms of possibilities and probabilities and were often happy to say “maybe.” In explaining their forecasts, they frequently shifted intellectual gears, sprinkling their speech with transition markers such as “although,” “but,” and “however.”
- It's unclear whether the performance of the best forecasters is the best that is in principle possible.
- This widespread lack of curiosity—lack of interest in thinking about how we think about possible futures—is a phenomenon worthy of investigation in its own right.
Tetlock has since started The Good Judgment Project (website, Wikipedia), a political forecasting competition where anybody can participate, and with a reputation of doing a much better job at prediction than anything else around. Participants are given a set of questions and can basically collect freely available online information (in some rounds, participants were given additional access to some proprietary data). They then use that to make predictions. The aggregate predictions are quite good. For more information, visit the website or see the references in the Wikipedia article. In particular, this Economist article and this Business Insider article are worth reading. (I discussed the GJP and other approaches to global political forecasting in this post).
So at least in the case of politics, it seems that amateurs, armed with basic information plus the freedom to look around for more, can use "fox-like" approaches and do a better job of forecasting than political scientists. Note that experts still do better than ignorant non-experts who are denied access to information. But once you have basic knowledge and are equipped to hunt more down, the constraining factor does not seem to be expertise, but rather, the approach you use (fox-like versus hedgehog-like). This should not be taken as a claim that expertise is irrelevant or unnecessary to forecasting. Experts play an important role in expanding the scope of knowledge and methodology that people can draw on to make their predictions. But the experts themselves, as people, do not have a unique advantage when it comes to forecasting.
Tetlock's research focused on politics. But the claim that the fox-hedgehog distinction turns out to be a better prediction of forecasting performance than the level of expertise is a general one. How true is this claim in domains other than politics? Domains such as climate science, economic growth, computing technology, or the arrival of artificial general intelligence?
Armstrong and Green again
J. Scott Armstrong is a leading figure in the forecasting community. Along with Kesten C. Green, he penned a critique of the forecasting exercises in climate science in 2007, with special focus on the IPCC reports. I discussed the critique at length in my post on the insularity critique of climate science. Here, I quote a part from the introduction of the critique that better explains the general prior that Armstrong and Green claim to be bringing to the table when they begin their evaluation. Of the points they make at the beginning, two bear directly on the deference we should give to expert judgment and expert consensus:
- Unaided judgmental forecasts by experts have no value: This applies whether the opinions are expressed in words, spreadsheets, or mathematical models. It applies regardless of how much scientific evidence is possessed by the experts. Among the reasons for this are:
a) Complexity: People cannot assess complex relationships through unaided observations.
b) Coincidence: People confuse correlation with causation.
c) Feedback: People making judgmental predictions typically do not receive unambiguous feedback they can use to improve their forecasting.
d) Bias: People have difficulty in obtaining or using evidence that contradicts their initial beliefs. This problem is especially serious for people who view themselves as experts.
- Agreement among experts is only weakly related to accuracy: This is especially true when the experts communicate with one another and when they work together to solve problems, as is the case with the IPCC process.
Armstrong and Green later elaborate on these claims, referencing Tetlock's work. (Note that I have removed the parts of the section that involve direct discussion of climate-related forecasts, since the focus here is on the general question of how much deference to show to expert consensus).
Many public policy decisions are based on forecasts by experts. Research on persuasion has shown that people have substantial faith in the value of such forecasts. Faith increases when experts agree with one another. Our concern here is with what we refer to as unaided expert judgments. In such cases, experts may have access to empirical studies and other information, but they use their knowledge to make predictions without the aid of well-established forecasting principles. Thus, they could simply use the information to come up with judgmental forecasts. Alternatively, they could translate their beliefs into mathematical statements (or models) and use those to make forecasts.
Although they may seem convincing at the time, expert forecasts can make for humorous reading in retrospect. Cerf and Navasky’s (1998) book contains 310 pages of examples, such as Fermi Award-winning scientist John von Neumann’s 1956 prediction that “A few decades hence, energy may be free”. [...] The second author’s review of empirical research on this problem led him to develop the “Seer-sucker theory,” which can be stated as “No matter how much evidence exists that seers do not exist, seers will find suckers” (Armstrong 1980). The amount of expertise does not matter beyond a basic minimum level. There are exceptions to the Seer-sucker Theory: When experts get substantial well-summarized feedback about the accuracy of their forecasts and about the reasons why their forecasts were or were not accurate, they can improve their forecasting. This situation applies for short-term (up to five day) weather forecasts, but we are not aware of any such regime for long-term global climate forecasting. Even if there were such a regime, the feedback would trickle in over many years before it became useful for improving forecasting.
Research since 1980 has provided much more evidence that expert forecasts are of no value. In particular, Tetlock (2005) recruited 284 people whose professions included, “commenting or offering advice on political and economic trends.” He asked them to forecast the probability that various situations would or would not occur, picking areas (geographic and substantive) within and outside their areas of expertise. By 2003, he had accumulated over 82,000 forecasts. The experts barely if at all outperformed non-experts and neither group did well against simple rules. Comparative empirical studies have routinely concluded that judgmental forecasting by experts is the least accurate of the methods available to make forecasts. For example, Ascher (1978, p. 200), in his analysis of long-term forecasts of electricity consumption found that was the case.
Note that the claims that Armstrong and Green make are in relation to unaided expert judgment, i.e., expert judgment that is not aided by some form of assistance or feedback that promotes improved forecasting. (One can argue that expert judgment in climate science is not unaided, i.e., that the critique is mis-applied to climate science, but whether that is the case is not the focus of my post). While Tetlock's suggestion to be more fox-like, Armstrong and Green recommend the use of their own forecasting principles, as encoded in their full list of principles and described on their website.
A conflict of intuitions, and an attempt to resolve it
I have two conflicting intuitions here. I like to use the majority view among experts as a reasonable Bayesian prior to start with, that I might then modify based on further study. The relevant question here is who the experts are. Do I defer to the views of domain experts, who may know little about the challenges of forecasting, or do I defer to the views of forecasting experts, who may know little of the domain but argue that domain experts who are not following good forecasting principles do not have any advantage over non-experts?
I think the following heuristics are reasonable starting points:
- In cases where we have a historical track record of forecasts, we can use that to evaluate the experts and non-experts. For instance, I reviewed the track record of survey-based macroeconomic forecasts, thanks to a wealth of recorded data on macroeconomic forecasts by economists over the last few decades. (Unfortunately, these surveys did not include corresponding data on layperson opinion).
- The faster the feedback from making a forecast to knowing whether it's right, the more likely it is that experts would have learned how to make good forecasts.
- The more central forecasting is to the overall goals of the domain, the more likely people are to get it right. For instance, forecasting is a key part of weather and climate science. But forecasting progress on mathematical problems has a negligible relation with doing mathematical research.
- Ceteris paribus, if experts are clearly recording their forecasts and the reasons behind them, and systematically evaluating the performance on past forecasts, that should be taken as (weak) evidence in favor of the experts' views being taken more seriously (even if we don't have enough of a historical track record to properly calibrate forecast accuracy). However, if they simply make forecasts but then fail to review their past history of forecasts, this may be taken as being about as bad as not forecasting at all. And in cases that the forecasts were bold, failed miserably, and yet the errors were not acknowledged, this should be taken as being considerably worse than not forecasting at all.
- A weak inside view of the nature of domain expertise can give some idea of whether expertise should generally translate to better forecasting skill. For instance, even a very weak understanding of physics will tell us that physicists are no more likely to determine whether a coin toss will yield heads or tails, even though the fate of the coin is determined by physics. Similarly, with the exception of economists who specialize in the study of macroeconomic indicators, one wouldn't expect economists to be able to forecast macroeconomic indicators better than most moderately economically informed people.
My first thought was that the more politicized a field, the less reliable any forecasts coming out of it. I think there are obvious reasons for that view, but there are also countervailing considerations.
The main claimed danger of politicization is groupthink and lack of openness to evidence. It could even lead to suppression, misrepresentation, or fabrication of evidence. Quite often, however, we see these qualities in highly non-political fields. People believe that certain answers are the right ones. Their political identity or ego is not attached to it. They just have high confidence that that answer is correct, and when the evidence they have does not match up, they think there is a problem with the evidence. Of course, if somebody does start challenging the mainstream view, and the issue is not quickly resolved either way, it can become politicized, with competing camps of people who hold the mainstream view and people who side with the challengers. Note, however, that the politicization has arguably reduced the aggregate amount of groupthink in the field. Now that there are two competing camps rather than one received wisdom, new people can examine evidence and better decide which camp is more on the side of truth. People in both camps, now that they are competing, may try to offer better evidence that could convince the undecideds or skeptics. So "politicization" might well improve the epistemic situation (I don't doubt that the opposite happens quite often). Examples of such politicization might be the replacement of geocentrism by heliocentrism, the replacement of creationism by evolution, and the replacement of Newtonian mechanics by relativity and/or quantum mechanics. In the first two cases, religious authorities pushed against the new idea, even though the old idea had not been a "politicized" tenet before the competing claims came along. In the case of Newtonian and quantum mechanics, the debate seems to have been largely intra-science, but quantum mechanics had its detractors, including Einstein, famous for the "God does not play dice" quip. (This post on Slate Star Codex is somewhat related).
The above considerations aren't specific to forecasting, and they apply even for assertions that fall squarely within the domain of expertise and require no forecasting skill per se. The extent to which they apply to forecasting problems is unclear. It's unclear whether most domains have any significant groupthink in favor of particular forecasts. In fact, in most domains, forecasts aren't really made or publicly recorded at all. So concerns of groupthink in a non-politicized scenario may not apply to forecasting. Perhaps the problem is the opposite: forecasts are so unimportant in many domains that the forecasts offered by experts are almost completely random and hardly informed in a systematic way by their expert knowledge. Even in such situations, politicization can be helpful, in so far as it makes the issue more salient and might prompt individuals to give more attention to trying to figure out which side is right.
The case of forecasting AI progress
I'm still looking at the case of forecasting AI progress, but for now, I'd like to point people to Luke Muehlhauser's excellent blog post from May 2013 discussing the difficulty with forecasting AI progress. Interestingly, he makes many points similar to those I make here. (Note: Although I had read the post around the time it was published, I hadn't read it recently until I finished drafting the rest of my current post. Nonetheless, my views can't be considered totally independent of Luke's because we've discussed my forecasting contract work for MIRI).
Should we expect experts to be good at predicting AI, anyway? As Armstrong & Sotala (2012) point out, decades of research on expert performance2 suggest that predicting the first creation of AI is precisely the kind of task on which we should expect experts to show poor performance — e.g. because feedback is unavailable and the input stimuli are dynamic rather than static. Muehlhauser & Salamon (2013) add, “If you have a gut feeling about when AI will be created, it is probably wrong.”
On the other hand, Tetlock (2005) points out that, at least in his large longitudinal database of pundit’s predictions about politics, simple trend extrapolation is tough to beat. Consider one example from the field of AI: when David Levy asked 1989 World Computer Chess Championship participants when a chess program would defeat the human World Champion, their estimates tended to be inaccurately pessimistic,8 despite the fact that computer chess had shown regular and predictable progress for two decades by that time. Those who forecasted this event with naive trend extrapolation (e.g. Kurzweil 1990) got almost precisely the correct answer (1997).
Looking for thoughts
I'm particularly interested in thoughts from people on the following fronts:
- What are some indicators you use to determine the reliability of forecasts by subject matter experts?
- How do you resolve the conflict of intuitions between deferring to the views of domain experts and deferring to the conclusion that forecasters have drawn about the lack of utility of domain experts' forecasts?
- In particular, what do you think of the way that "politicization" affects the reliability of forecasts?
- Also, how much value do you assign to agreement between experts when judging how much trust to place in expert forecasts?
- Comments that elaborate on these questions or this general topic within the context of a specific domain or domains would also be welcome.
This is a somewhat long and rambling post. Apologies for the length. I hope the topic and content are interesting enough for you to forgive the meandering presentation.
I blogged about the scenario planning method a while back, where I linked to many past examples of scenario planning exercises. In this post, I take a closer look at scenario analysis in the context of understanding the possibilities for the unfolding of technological progress over the next 10-15 years. Here, I will discuss some predetermined elements and critical uncertainties, offer my own scenario analysis, and then discuss scenario analyses by others.
Remember: it is not the purpose of scenario analysis to identify a set of mutually exclusive and collectively exhaustive outcomes. In fact, usually, the real-world outcome has some features from two or more of the scenarios considered, with one scenario dominating somewhat. As I noted in my earlier post:
The utility of scenario analysis is not merely in listing a scenario that will transpire, or a collection of scenarios a combination of which will transpire. The utility is in how it prepares the people undertaking the exercise for the relevant futures. One way it could so prepare them is if the early indicators of the scenarios are correctly chosen and, upon observing them, people are able to identify what scenario they're in and take the appropriate measures quickly. Another way is by identifying some features that are common to all scenarios, though the details of the feature may differ by scenario. We can therefore have higher confidence in these common features and can make plans that rely on them.
The predetermined element: the imminent demise of Moore's law "as we know it"
As Steven Schnaars noted in Megamistakes (discussed here), forecasts of technological progress in most domains have been overoptimistic, but in the domain of computing, they've been largely spot-on, mostly because the raw technology has improved quickly. The main reason has been Moore's law, and a couple other related laws, that have undergirded technological progress. But now, the party is coming to an end! The death of Moore's law (as we know it) is nigh, and there are significant implications for the future of computing.
Moore's law refers to many related claims about technological progress. Some forms of this technological progress have already stalled. Other forms are slated to stall in the near future, barring unexpected breakthroughs. These facts about Moore's law form the backdrop for all our scenario planning.
The critical uncertainty arises in how industry will respond to the prospect of Moore's law death. Will there be a doubling down on continued improvement at the cutting edge? Will the battle focus on cost reductions? Or will we have neither cost reduction nor technological improvement? What sort of pressure will hardware stagnation put on software?
Now, onto a description of the different versions of Moore's law (slightly edited version of information from Wikipedia):
Density at minimum cost per transistor. This is the formulation given in Moore's 1965 paper. It is not just about the density of transistors that can be achieved, but about the density of transistors at which the cost per transistor is the lowest. As more transistors are put on a chip, the cost to make each transistor decreases, but the chance that the chip will not work due to a defect increases. In 1965, Moore examined the density of transistors at which cost is minimized, and observed that, as transistors were made smaller through advances in photolithography, this number would increase at "a rate of roughly a factor of two per year".
Dennard scaling. This suggests that power requirements are proportional to area (both voltage and current being proportional to length) for transistors. Combined with Moore's law, performance per watt would grow at roughly the same rate as transistor density, doubling every 1–2 years. According to Dennard scaling transistor dimensions are scaled by 30% (0.7x) every technology generation, thus reducing their area by 50%. This reduces the delay by 30% (0.7x) and therefore increases operating frequency by about 40% (1.4x). Finally, to keep electric field constant, voltage is reduced by 30%, reducing energy by 65% and power (at 1.4x frequency) by 50%. Therefore, in every technology generation transistor density doubles, circuit becomes 40% faster, while power consumption (with twice the number of transistors) stays the same.
So how are each of these faring?
- Transistors per integrated circuit: At least in principle, this can continue for a decade or so. The technological ideas exist to publish transistor sizes down from the current values of 32 nm and 28 nm all the way down to 7 nm.
- Density at minimum cost per transistor. This is probably stopping around now. There is good reason to believe that, barring unexpected breakthroughs, the transistor size for which we have minimum cost per transistor shall not go down below 28 nm. There may still be niche applications that benefit from smaller transistor sizes, but there will be no overwhelming economic case to switch production to smaller transistor sizes (i.e., higher densities).
- Dennard scaling. This broke down around 2005-2007. So for approximately a decade, we've essentially seen continued miniaturization but without any corresponding improvement in processor speed or performance per watt. There have been continued overall improvements in energy efficiency of computing, but not through this mechanism. The absence of automatic speed improvements has led to increased focus on using greater parallelization (note that the miniaturization means more parallel processors can be packed in the same space, so Moore's law is helping in this other way). In particular, there has been an increased focus on multicore processors, though there may be limits to how far that can take us too.
Moore's law isn't the only law that is slated to end. Other similar laws, such as Kryder's law (about the cost of hard disk space) may also end in the near future. Koomey's law on energy efficiency may also stall, or might continue to hold but through very different mechanisms compared to the ones that have driven it so far.
Some discussions that do not use explicit scenario analysis
The quotes below are to give a general idea of what people seem to generally agree on, before we delve into different scenarios.
We have been hearing about the imminent demise of Moore's Law quite a lot recently. Most of these predictions have been targeting the 7nm node and 2020 as the end-point. But we need to recognize that, in fact, 28nm is actually the last node of Moore's Law.
Summarizing all of these factors, it is clear that -- for most SoCs -- 28nm will be the node for "minimum component costs" for the coming years. As an industry, we are facing a paradigm shift because dimensional scaling is no longer the path for cost scaling. New paths need to be explored such as SOI and monolithic 3D integration. It is therefore fitting that the traditional IEEE conference on SOI has expanded its scope and renamed itself as IEEE S3S: SOI technology, 3D Integration, and Subthreshold Microelectronics.
Computer scientist Moshe Yardi writes:
So the real question is not when precisely Moore's Law will die; one can say it is already a walking dead. The real question is what happens now, when the force that has been driving our field for the past 50 years is dissipating. In fact, Moore's Law has shaped much of the modern world we see around us. A recent McKinsey study ascribed "up to 40% of the global productivity growth achieved during the last two decades to the expansion of information and communication technologies made possible by semiconductor performance and cost improvements." Indeed, the demise of Moore's Law is one reason some economists predict a "great stagnation" (see my Sept. 2013 column).
"Predictions are difficult," it is said, "especially about the future." The only safe bet is that the next 20 years will be "interesting times." On one hand, since Moore's Law will not be handing us improved performance on a silver platter, we will have to deliver performance the hard way, by improved algorithms and systems. This is a great opportunity for computing research. On the other hand, it is possible that the industry would experience technological commoditization, leading to reduced profitability. Without healthy profit margins to plow into research and development, innovation may slow down and the transition to the post-CMOS world may be long, slow, and agonizing.
However things unfold, we must accept that Moore's Law is dying, and we are heading into an uncharted territory.
"I drive a 1964 car. I also have a 2010. There's not that much difference -- gross performance indicators like top speed and miles per gallon aren't that different. It's safer, and there are a lot of creature comforts in the interior," said Nvidia Chief Scientist Bill Dally. If Moore's Law fizzles, "We'll start to look like the auto industry."
Three critical uncertainties: technological progress, demand for computing power, and interaction with software
Uncertainty #1: Technological progress
Moore's law is dead, long live Moore's law! Even if Moore's law as originally stated is no longer valid, there are other plausible computing advances that would preserve the spirit of the law.
Minor modifications of current research (as described in EETimes) include:
- Improvements in 3D circuit design (Wikipedia), so that we can stack multiple layers of circuits one on top of the other, and therefore pack more computing power per unit volume.
- Improvements in understanding electronics at the nanoscale, in particular understanding subthreshold leakage (Wikipedia) and how to tackle it.
Then, there are possibilities for totally new computing paradigms. These have fairly low probability, and are highly unlikely to become commercially viable within 10-15 years. Each of these offers an advantage over currently available general-purpose computing only for special classes of problems, generally those that are parallelizable in particular ways (the type of parallelizability needed differs somewhat between the computing paradigms).
- Quantum computing (Wikipedia) (speeds up particular types of problems). Quantum computers already exist, but the current ones can tackle only a few qubits. Currently, the best known quantum computers in action are those maintained at the Quantum AI Lab (Wikipedia) run jointly by Google, NASA. and USRA. It is currently unclear how to manufacture quantum computers with a larger number of qubits. It's also unclear how the cost will scale in the number of qubits. If the cost scales exponentially in the number of qubits, then quantum computing will offer little advantage over classical computing. Ray Kurzweil explains this as follows:
A key question is: how difficult is it to add each additional qubit? The computational power of a quantum computer grows exponentially with each added qubit, but if it turns out that adding each additional qubit makes the engineering task exponentially more difficult, we will not be gaining any leverage. (That is, the computational power of a quantum computer will be only linearly proportional to the engineering difficulty.) In general, proposed methods for adding qubits make the resulting systems significantly more delicate and susceptible to premature decoherence.
Kurzweil, Ray (2005-09-22). The Singularity Is Near: When Humans Transcend Biology (Kindle Locations 2152-2155). Penguin Group. Kindle Edition.
- DNA computing (Wikipedia)
- Other types of molecular computing (Technology Review featured story from 2000, TR story from 2010)
- Spintronics (Wikipedia): The idea is to store information using the spin of the electron, a quantum property that is binary and can be toggled at zero energy cost (in principle). The main potential utility of spintronics is in data storage, but it could potentially help with computation as well.
- Optical computing aka photonic computing (Wikipedia): This uses beams of photons that store the relevant information that needs to be manipulated. Photons promise to offer higher bandwidth than electrons, the tool used in computing today (hence the name electronic computing).
Uncertainty #2: Demand for computing
Even if computational advances are possible in principle, the absence of the right kind of demand can lead to a lack of financial incentive to pursue the relevant advances. I discussed the interaction between supply and demand in detail in this post.
As that post discussed, demand for computational power at the consumer end is probably reaching saturation. The main source of increased demand will now be companies that want to crunch huge amounts of data in order to more efficiently mine data for insight and offer faster search capabilities to their users. The extent to which such demand grows is uncertain. In principle, the demand is unlimited: the more data we collect (including "found data" that will expand considerably as the Internet of Things grows), the more computational power is needed to apply machine learning algorithms to the data. Since the complexity of many machine learning algorithms grows at least linearly (and in some cases quadratically or cubically) in the data, and the quantity of data itself will probably grow superlinearly, we do expect a robust increase in demand for computing.
Uncertainty #3: Interaction with software
Much of the increased demand for computing, as noted above, does not arise so much from a need for raw computing power by consumers, but a need for more computing power to manipulate and glean insight from large data sets. While there has been some progress with algorithms for machine learning and data mining, the fields are probably far from mature. So an alternative to hardware improvements is improvements in the underlying algorithms. In addition to the algorithms themselves, execution details (such as better use of parallel processing capabilities and more efficient use of idle processor capacity) can also yield huge performance gains.
This might be a good time to note a common belief about software and why I think it's wrong. We often tend to hear of software bloat, and some people subscribe to Wirth's law, the claim that software is getting slower more quickly than hardware is getting faster. I think that there are some softwares that have gotten feature-bloated over time, largely because there are incentives to keep putting out new editions that people are willing to pay money for, and Microsoft Word might be one case of such bloat. For the most part, though, software has been getting more efficient, partly by utilizing the new hardware better, but also partly due to underlying algorithmic improvements. This was one of the conclusions of Katja Grace's report on algorithmic progress (see also this link on progress on linear algebra and linear programming algorithms). There are a few softwares that get feature-bloated and as a result don't appear to improve over time as far as speed goes, but it's arguably the case that people's revealed preferences show that they are willing to put up with the lack of speed improvements as long as they're getting feature improvements.
Computing technology progress over the next 10-15 years: my three scenarios
- Slowdown to ordinary rates of growth of cutting-edge industrial productivity: For the last few decades, several dimensions of computing technology have experienced doublings over time periods ranging from six months to five years. With such fast doubling, we can expect price-performance thresholds for new categories of products to be reached every few years, with multiple new product categories a decade. Consider, for instance, desktops, then laptops, then smartphones, then tablets. If the doubling time reverts to the norm seen in other cutting-edge industrial sectors, namely 10-25 years, then we'd probably see the introduction of revolutionary new product categories only about once a generation. There are already some indications of a possible slowdown, and it remains to be seen whether we see a bounceback.
- Continued fast doubling: The other possibility is that the evidence for a slowdown is largely illusory, and computing technology will continue to experience doublings over timescales of less than five years. There would therefore be scope to introduce new product categories every few years.
- New computing paradigm with high promise, but requiring significant adjustment: This is an unlikely, but not impossible, scenario. Here, a new computing paradigm, such as quantum computing, reaches the realm of feasibility. However, the existing infrastructure of algorithms is ill-designed for quantum computing, and in fact, quantum computing engenders many security protocols while offering its own unbreakable ones. Making good use of this new paradigm requires a massive re-architecting of the world's computing infrastructure.
There are two broad features that are likely to be common to all scenarios:
- Growing importance of algorithms: Scenario (1): If technological progress in computing power stalls, then the pressure for improvements to the algorithms and software may increase. Scenario (2): if technological progress in computing power continues, that might only feed the hunger for bigger data. And as the size of data sets increases, asymptotic performance starts mattering more (the distinction between O(n) and O(n2) matters more when n is large). In both cases, I expect more pressure on algorithms and software, but in different ways: in the case of stalling hardware progress, the focus will be more on improving the software and making minor changes to improve the constants, whereas in the case of rapid hardware progress, the focus will be more on finding algorithms that have better asymptotic (big-oh) performance. Scenario (3): In the case of paradigm shifts, the focus will be on algorithms that better exploit the new paradigm. In all cases, there will need to be some sort of shift toward new algorithms and new code that better exploits the new situation.
- Growing importance of parallelization: Although the specifics of how algorithms will become more important varies between the scenarios, one common feature is that algorithms that can better make parallel use of large numbers of machines will become more important. We have seen parallelization grow in importance over the last 15 years, even as the computing gains for individual processors through Moore's law seems to be plateauing out, while data centers have proliferated in number. However, the full power of parallelization is far from tapped out. Again, parallelization matters for slightly different reasons in different cases. Scenario (1): A slowdown in technological progress would mean that gains in the amount of computation can largely be achieved by scaling up the number of machines. In other words, the usage of computing shifts further in a capital-intensive direction. Parallel computing is important for effective utilization of this capital (the computing resources). Scenario (2): Even in the face of rapid hardware progress, automatic big data generation will likely improve much faster than storage, communication, and bandwidth. This "big data" is too huge to store or even stream on a single machine, so parallel processing across huge clusters of machines becomes important. Scenario (3): Note also that almost all the new computing paradigms currently under consideration (including quantum computing) offer massive advantages for special types of parallelizable problems, so parallelization matters even in the case of a paradigm shift in computing.
Other scenario analyses
McKinsey carried out a scenario analysis here, focused more on the implications for the semiconductor manufacturing industry than for users of computing. The report notes the importance of Moore's law in driving productivity improvements over the last few decades:
As a result, Moore’s law has swept much of the modern world along with it. Some estimates ascribe up to 40 percent of the global productivity growth achieved during the last two decades to the expansion of information and communication technologies made possible by semiconductor performance and cost improvements.
The scenario analysis identifies four potential sources of innovation related to Moore's law:
- More Moore (scaling)
- Wafer-size increases (maximize productivity)
- More than Moore (functional diversification)
- Beyond CMOS (new technologies)
Their scenario analysis uses a 2 X 2 model, with the two dimensions under consideration being performance improvements (continue versus stop) and cost improvements (continue versus stop). The case that both performance improvements and cost improvements continue is the "good" case for the semiconductor industry. The case that both stop is the case where the industry is highly likely to get commodified, with profit margins going down and small players catching up to the big ones. In the intermediate cases (where one of the two continues and the other stops), consolidation of the semiconductor industry is likely to continue, but there is still a risk of falling demand.
The McKinsey scenario analysis was discussed by Timothy Taylor on his blog, The Conversable Economist, here.
Roland Berger carried out a detailed scenario analysis focused on the "More than Moore" strategy here.
Blegging for missed scenarios, common features and early indicators
Are there scenarios that the analyses discussed above missed? Are there some types of scenario analysis that we didn't adequately consider? If you had to do your own scenario analysis for the future of computing technology and hardware progress over the next 10-15 years, what scenarios would you generate?
As I noted in my earlier post:
The utility of scenario analysis is not merely in listing a scenario that will transpire, or a collection of scenarios a combination of which will transpire. The utility is in how it prepares the people undertaking the exercise for the relevant futures. One way it could so prepare them is if the early indicators of the scenarios are correctly chosen and, upon observing them, people are able to identify what scenario they're in and take the appropriate measures quickly. Another way is by identifying some features that are common to all scenarios, though the details of the feature may differ by scenario. We can therefore have higher confidence in these common features and can make plans that rely on them.
I already identified some features I believe to be common to all scenarios (namely, increased focus on algorithms, and increased focus on parallelization). Do you agree with my assessment that these are likely to matter regardless of scenario? Are there other such common features you have high confidence in?
If you generally agree with one or more of the scenario analyses here (mine or McKinsey's or Roland Berger's), what early indicators would you use to identify which of the enumerated scenarios we are in? Is it possible to look at how events unfold over the next 2-3 years and draw intelligent conclusions from that about the likelihood of different scenarios?
If it's worth saying, but not worth its own post (even in Discussion), then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one.
3. Open Threads should be posted in Discussion, and not Main.
4. Open Threads should start on Monday, and end on Sunday.
Your job, should you choose to accept it, is to comment on this thread explaining the most awesome thing you've done this since June 1st. You may be as blatantly proud of yourself as you feel. You may unabashedly consider yourself the coolest freaking person ever because of that awesome thing you're dying to tell everyone about. This is the place to do just that.
Remember, however, that this isn't any kind of progress thread. Nor is it any kind of proposal thread. This thread is solely for people to talk about the awesome things they have done. Not "will do". Not "are working on". Have already done. This is to cultivate an environment of object level productivity rather than meta-productivity methods.
So, what's the coolest thing you've done this month?
Jason Mitchell is [edit: has been] the John L. Loeb Associate Professor of the Social Sciences at Harvard. He has won the National Academy of Science's Troland Award as well as the Association for Psychological Science's Janet Taylor Spence Award for Transformative Early Career Contribution.
Here, he argues against the principle of replicability of experiments in science. Apparently, it's disrespectful, and presumptively wrong.
Recent hand-wringing over failed replications in social psychology is largely pointless, because unsuccessful experiments have no meaningful scientific value.
Because experiments can be undermined by a vast number of practical mistakes, the likeliest explanation for any failed replication will always be that the replicator bungled something along the way. Unless direct replications are conducted by flawless experimenters, nothing interesting can be learned from them.
Three standard rejoinders to this critique are considered and rejected. Despite claims to the contrary, failed replications do not provide meaningful information if they closely follow original methodology; they do not necessarily identify effects that may be too small or flimsy to be worth studying; and they cannot contribute to a cumulative understanding of scientific phenomena.
Replication efforts appear to reflect strong prior expectations that published findings are not reliable, and as such, do not constitute scientific output.
The field of social psychology can be improved, but not by the publication of negative findings. Experimenters should be encouraged to restrict their “degrees of freedom,” for example, by specifying designs in advance.
Whether they mean to or not, authors and editors of failed replications are publicly impugning the scientific integrity of their colleagues. Targets of failed replications are justifiably upset, particularly given the inadequate basis for replicators’ extraordinary claims.
This is why we can't have social science. Not because the subject is not amenable to the scientific method -- it obviously is. People are conducting controlled experiments and other people are attempting to replicate the results. So far, so good. Rather, the problem is that at least one celebrated authority in the field hates that, and would prefer much, much more deference to authority.
Note: This post is part of my series of posts on forecasting, but this particular post may be of fairly limited interest to many LessWrong readers. I'm posting it here mainly for completeness. As always, I appreciate feedback.
In the course of my work looking at forecasting for MIRI, I repeatedly encountered discussions of how to communicate forecasts. In particular, a concern that emerged repeatedly was the clear communication of the uncertainty in forecasts. Nate Silver's The Signal and the Noise, in particular, focused quite a bit on the virtue of clear communication of uncertainty, in contexts as diverse as financial crises, epidemiology, weather forecasting, and climate change.
In this post, I pull together discussions from a variety of domains about the communication of uncertainty, and also included my overall impression of the findings.
Summary of overall findings
- In cases where forecasts are made and used frequently (the most salient example being temperature and precipitation forecasts) people tend to form their own models of the uncertainty surrounding forecasts, even if you present forecasts as point estimates. The models people develop are quite similar to the correct ones, but still different in important ways.
- In cases where forecasts are made more rarely, as with forecasting rare events, people are more likely to have simpler models that acknowledge some uncertainty but are less nuanced. In these cases, acknowledging uncertainty becomes quite important, because wrong forecasts of such events can lead to a loss of trust in the forecasting process, and can lead people to ignore correct forecasts later.
- In some cases, there are arguments for modestly exaggerating small probabilities to overcome specific biases that people have that cause them to ignore low-probability events.
- However, the balance of evidence suggests that forecasts should be reported as honestly as possible, and all uncertainty should be clearly acknowledged. If the forecast does not acknowledge uncertainty, people are likely to either use their own models of uncertainty, or lose faith in the forecasting process entirely if the forecast turns out to be far off from reality.
Probabilities of adverse events and the concept of the cost-loss ratio
A useful concept developed for understanding the utility of weather forecasting is the cost-loss model (Wikipedia). Consider that if a particular adverse event occurs, and we do not take precautionary measures, the loss incurred is L, whereas if we do take precautionary measures, the cost is C, regardless of whether the event occurs. An example: you're planning an outdoor party, and the adverse event in question is rain. If it rains during the event, you experience a loss of L. If you knew in advance that it would rain, you'd move the venue indoors, at a cost of C. Obviously, C < L for you to even consider the precautionary measure.
The ratio C/L is termed the cost-loss ratio and describes the probability threshold above which it makes sense to take the precautionary measure.
One way of thinking of the utility of weather forecasting, particularly in the context of forecasting adverse events (rain, snow, winds, and more extreme events) is in terms of whether people have adequate information to make correct decisions based on their cost-loss model. This would boil down to several questions:
- Is the probability of the adverse event communicated with sufficient clarity and precision that people who need to use it can plug it into their cost-loss model?
- Do people have a correct estimate of their cost-loss ratio (implicitly or explicitly)?
As I discussed in an earlier post, The Weather Channel has admitted to explicitly introducing wet bias into its probability-of-precipitation (PoP) forecasts. The rationale they offered could be interpreted as a claim that people overestimate their cost-loss ratio. For instance, a person may think his cost-loss ratio for precipitation is 0.2 (20%), but his actual cost-loss ratio may be 0.05 (5%). In this case, in order to make sure people still make the "correct" decision, PoP forecasts that fall between 0.05 and 0.2 would need to inflated to 0.2 or higher. Note that TWC does not introduce wet bias at higher probabilities of precipitation, arguably because (they believe) that this is well above the cost-loss ratio for most situations.
Words of estimative probability
In 1964, Sherman Kent (Wikipedia), the father of intelligence analysis, wrote an essay titled "Words of Estimative Probability" that discussed the use of words to describe probability estimates, and how different people may interpret the same word as referring to very different ranges of probability estimates. The concept of words of estimative probability (Wikipedia), along with its acronym, WEP, is now standard jargon in intelligence analysis.
Some related discussion of the use of words to convey uncertainty in estimates can be found in the part of this post where I excerpt from the paper discussing the communication of uncertainty in climate change.
Other general reading
- Nate Silver's The Signal and the Noise is worth reading in full if this topic interests you.
- The essay Communicating Uncertainty: Fulfilling the Duty to Inform by Baruch Fischhoff does a great job of reviewing communication uncertainty and how decision-makers can do a better job of eliciting uncertainty information from subject matter experts.
#1: The case of weather forecasting
Weather forecasting has some features that make it stand out among other forecasting domains:
- Forecasts are published explicitly and regularly: News channels and newspapers carry forecasts every day. Weather websites update their forecasts on at least an hourly basis, sometimes even faster, particularly if there are unusual weather developments. In the United States, The Weather Channel is dedicated to 24 X 7 weather news coverage.
- Forecasts are targeted at and consumed by the general public: This sets weather forecasting apart from other forms of forecasting and prediction. We can think of prices in financial markets and betting markets as implicit forecasts. But they are targeted at the niche audiences that pay attention to them, not at everybody. The mode of consumption varies. Some people just get their forecasts from the weather reports in their local TV and radio channel. Some people visit the main weather websites (such as the National Weather Service, The Weather Channel, AccuWeather, or equivalent sources in other countries). Some people have weather reports emailed to them daily. As smartphones grow in popularity, weather apps are an increasingly common way for people to keep tabs on the weather. The study on communicating weather uncertainty (discussed below) found that in the United States, people in its sample audience saw weather forecasts an average of 115 times a month. Even assuming heavy selection bias in the study, people in the developed world probably encounter a weather forecast at least once a day.
- Forecasts are used to drive decision-making: Particularly in places where weather fluctuations are significant, forecasts play an important role in event planning for individuals and organizations. At the individual level, this can include deciding whether to carry an umbrella, choosing what clothes to wear, deciding whether to wear snow boots, deciding whether conditions are suitable for driving, and many other small decisions. At the organizational level, events may be canceled or relocated based on forecasts of adverse weather. In locations with variable weather, it's considered irresponsible to plan an event without checking the weather forecast.
- People get quick feedback on whether the forecast was accurate: The next day, people know whether what was forecast transpired.
The upshot: people are exposed to weather forecasts, pay attention to them, base decisions on them, and then come to know whether the forecast was correct. This happens on a daily basis. Therefore, they have both the incentives and the information to form their own mental model of the reliability and uncertainty in forecasts. Note also that because the reliability of forecasts varies considerably by location, people who move from one location to another may take time adjusting to the new location. (For instance, when I moved to Chicago, I didn't pay much attention to weather forecasts in the beginning, but soon learned that the high variability of the weather combined with reasonable accuracy of forecasts made then worth paying attention to. Now that I'm in Berkeley, I probably pay too much attention to the forecast relative to its value, given the stability of weather in Berkeley).
With these general thoughts in mind, let's look at the paper Communicating Uncertainty in Weather Forecasts: A Survey of the U. S. Public by Rebecca E. Morss, Julie L. Demuth, and Jeffrey K. Lazo. The paper is based on a survey of about 1500 people in the United States. The whole paper is worth a careful read if you find the issue fascinating. But for the benefits of those of you who find the issue somewhat interesting but not enough to read the paper, I include some key takeaways from the paper.
Temperature forecasts: the authors find that even though temperature forecasts are generally made as point estimates, people interpret these point estimates as temperature ranges. The temperature ranges are not even necessarily centered at the point estimates. Further, the range of temperatures increases with the forecast horizon. In other words, people (correctly) realize that forecasts made for three days later have more uncertainty attached to them than forecasts made for one day later. In other words, peoples understanding of the nature of forecast uncertainty in temperatures is correct, at least in the broad qualitative sense.
The authors believe that people arrive at these correct models through their own personal history of seeing weather forecasts and evaluating how they compare with the reality. Clearly, most people don't keep close track of how forecasts compare with the reality, but they are still able to get the general idea over several years of exposure to weather forecasts. The authors also believe that since the accuracy of weather forecasts varies by region, people's models of uncertainty may also differ by region. However, the data they collect does not allow for a test of this hypothesis. For more, read Sections 3a and 3b of the paper.
Probability-of-precipitation (PoP) forecasts: The authors also look at people's perception of probability-of-precipitation (PoP) forecasts. The correct meteorological interpretation of PoP is "the probability that precipitation occurs given these meteorological conditions." The frequentist operationalization of this would be "the fraction (situations with meteorological conditions like this where precipitation does occur)/(situations with meteorological conditions like this)." To what extent are people aware of this meaning? One of the questions in the survey elicits information on this front:
TABLE 2. Responses to Q14a, the meaning of the forecast
“There is a 60% chance of rain for tomorrow” (N 1330).
It will rain tomorrow in 60% of the region. 16% of respondents
It will rain tomorrow for 60% of the time. 10% of respondents
It will rain on 60% of the days like tomorrow.* 19% of respondents
60% of weather forecasters believe that it will rain tomorrow. 22% of respondents
I don’t know. 9% of respondents
Other (please explain). 24% of respondents
* Technically correct interpretation, according to how PoP forecasts are verified, as interpreted by Gigerenzer et al. (2005).
So about 19% of participants choose the correct meteorological interpretation. However, of the 24% who offer other explanations, many suggest that they are not so much interested in the meteorological interpretation as in how this affects their decision-making. So it might be the case that even if people aren't aware of the frequentist definition, they are still using the information approximately correctly as it applies to their lives. One such application would be a comparison with the cost-loss ratio to determine whether to engage in precautionary measures. Note that, as noted earlier in the post, it may be the case that people overestimate their own cost-loss ratio, but this is a distinct problem from incorrectly interpreting the probability.
I also found the following resources, that I haven't had the time to read through, but that might help people interested in exploring the issue in more detail (I'll add more to this list if I find more):
- Completing the Forecast: Characterizing and Communicating Uncertainty for Better Decisions Using Weather and Climate Forecasts (2006), open book by the National Academies Press.
#2: Extreme rare events (usually weather-related) that require significant response
For some rare events (such as earthquakes) we don't know how to make specific predictions of their imminent arrival. But for others, such as hurricanes, cyclones, blizzards, tornadoes, and thunderstorms, specific probabilistic predictions can be made. Based on these predictions, significant action can be undertaken, ranging from everybody deciding to stock up on supplies and stay at home, to a mass evacuation. Such responses are quite costly, but the loss they would avert if the event did occur is even bigger. In the cost-loss framework discussed above, we are dealing with both a high cost and a loss that could be much higher. However, unlike the binary case discussed above, the loss spans more of a continuum: the amount of loss that would occur without precautionary measures depends on the intensity of the event. Similarly, the costs span a continuum: the cost depends on the extent of precautionary measures taken.
Since both the cost and loss are huge, it's quite important to get a good handle on the probability. But should the correct probability be communicated, or should it be massaged or simply converted to a "yes/no" statement? We discussed earlier the (alleged) problem of people overestimating their cost-loss ratio, and therefore not taking adequate precautionary measures, and how the Weather Channel addresses this by deliberately introducing a wet bias. But the stakes are much higher when we are talking of shutting down a city for a day or ordering a mass evacuation.
Another complication is that the rarity of the event means that people's own mental models haven't had a lot of data to calibrate the accuracy and reliability of forecasts. When it comes to temperature and precipitation forecasts, people have years of experience to rely on. They will not lose faith in a forecast based on a single occurrence. When it comes to rare events, even a few memories of incorrect forecasts, and the concomitant huge costs or huge losses, can lead people to be skeptical of the forecasts in the future. In The Signal and the Noise, Nate Silver extensively discusses the case of Hurricane Katrina and the dilemmas facing the mayor of New Orleans that led him to delay the evacuation of the city, and led many people to ignore the evacuation order even after it was announced.
A direct strike of a major hurricane on New Orleans had long been every weather forecaster’s worst nightmare. The city presented a perfect set of circumstances that might contribute to the death and destruction there. [...]
The National Hurricane Center nailed its forecast of Katrina; it anticipated a potential hit on the city almost five days before the levees were breached, and concluded that some version of the nightmare scenario was probable more than forty-eight hours away . Twenty or thirty years ago, this much advance warning would almost certainly not have been possible, and fewer people would have been evacuated. The Hurricane Center’s forecast, and the steady advances made in weather forecasting over the past few decades, undoubtedly saved many lives.
Not everyone listened to the forecast, however. About 80,000 New Orleanians —almost a fifth of the city’s population at the time— failed to evacuate the city, and 1,600 of them died. Surveys of the survivors found that about two-thirds of them did not think the storm would be as bad as it was. Others had been confused by a bungled evacuation order; the city’s mayor, Ray Nagin, waited almost twenty-four hours to call for a mandatory evacuation, despite pleas from Mayfield and from other public officials. Still other residents— impoverished, elderly, or disconnected from the news— could not have fled even if they had wanted to.
Silver, Nate (2012-09-27). The Signal and the Noise: Why So Many Predictions Fail-but Some Don't (pp. 109-110). Penguin Group US. Kindle Edition.
So what went wrong? Silver returns to this later in the chapter:
As Max Mayfield told Congress, he had been prepared for a storm like Katrina to hit New Orleans for most of his sixty-year life. Mayfield grew up around severe weather— in Oklahoma, the heart of Tornado Alley— and began his forecasting career in the Air Force, where people took risk very seriously and drew up battle plans to prepare for it. What took him longer to learn was how difficult it would be for the National Hurricane Center to communicate its forecasts to the general public.
“After Hurricane Hugo in 1989,” Mayfield recalled in his Oklahoma drawl, “I was talking to a behavioral scientist from Florida State. He said people don’t respond to hurricane warnings. And I was insulted. Of course they do. But I have learned that he is absolutely right. People don’t respond just to the phrase ‘hurricane warning.’ People respond to what they hear from local officials. You don’t want the forecaster or the TV anchor making decisions on when to open shelters or when to reverse lanes.”
Under Mayfield’s guidance, the National Hurricane Center began to pay much more attention to how it presented its forecasts. It contrast to most government agencies, whose Web sites look as though they haven’t been updated since the days when you got those free AOL CDs in the mail, the Hurricane Center takes great care in the design of its products, producing a series of colorful and attractive charts that convey information intuitively and accurately on everything from wind speed to storm surge.
The Hurricane Center also takes care in how it presents the uncertainty in its forecasts. “Uncertainty is the fundamental component of weather prediction,” Mayfield said. “No forecast is complete without some description of that uncertainty.” Instead of just showing a single track line for a hurricane’s predicted path, for instance, their charts prominently feature a cone of uncertainty—“ some people call it a cone of chaos,” Mayfield said. This shows the range of places where the eye of the hurricane is most likely to make landfall. Mayfield worries that even this isn’t enough. Significant impacts like flash floods (which are often more deadly than the storm itself) can occur far from the center of the storm and long after peak wind speeds have died down. No people in New York City died from Hurricane Irene in 2011 despite massive media hype surrounding the storm, but three people did from flooding in landlocked Vermont once the TV cameras were turned off.
Mayfield told Nagin that he needed to issue a mandatory evacuation order, and to do so as soon as possible.
Nagin dallied, issuing a voluntary evacuation order instead. In the Big Easy, that was code for “take it easy”; only a mandatory evacuation order would convey the full force of the threat. Most New Orleanians had not been alive when the last catastrophic storm, Hurricane Betsy, had hit the city in 1965. And those who had been, by definition, had survived it. “If I survived Hurricane Betsy, I can survive that one, too. We all ride the hurricanes, you know,” an elderly resident who stayed in the city later told public officials. Reponses like these were typical. Studies from Katrina and other storms have found that having survived a hurricane makes one less likely to evacuate the next time one comes.
The reasons for Nagin’s delay in issuing the evacuation order is a matter of some dispute— he may have been concerned that hotel owners might sue the city if their business was disrupted. Either way, he did not call for a mandatory evacuation until Sunday at 11 A.M. —and by that point the residents who had not gotten the message yet were thoroughly confused . One study found that about a third of residents who declined to evacuate the city had not heard the evacuation order at all. Another third heard it but said it did not give clear instructions. Surveys of disaster victims are not always reliable— it is difficult for people to articulate why they behaved the way they did under significant emotional strain, and a small percentage of the population will say they never heard an evacuation order even when it is issued early and often. But in this case, Nagin was responsible for much of the confusion.
There is, of course, plenty of blame to go around for Katrina— certainly to FEMA in addition to Nagin. There is also credit to apportion— most people did evacuate, in part because of the Hurricane Center’s accurate forecast. Had Betsy topped the levees in 1965, before reliable hurricane forecasts were possible, the death toll would probably have been even greater than it was in Katrina. One lesson from Katrina, however, is that accuracy is the best policy for a forecaster. It is forecasting’s original sin to put politics, personal glory, or economic benefit before the truth of the forecast. Sometimes it is done with good intentions, but it always makes the forecast worse. The Hurricane Center works as hard as it can to avoid letting these things compromise its forecasts. It may not be a concidence that, in contrast to all the forecasting failures in this book, theirs have become 350 percent more accurate in the past twenty-five years alone.
“The role of a forecaster is to produce the best forecast possible,” Mayfield says. It’s so simple— and yet forecasters in so many fields routinely get it wrong.
Silver, Nate (2012-09-27). The Signal and the Noise: Why So Many Predictions Fail-but Some Don't (pp. 138-141). Penguin Group US. Kindle Edition.
Silver notes similar failures of communication of forecast uncertainty in other domains, including exaggeration of the 1976 swine flu outbreak.
I also found a few related papers that may be worth reading if you're interested in understanding the communication of weather-related rare event forecasts:
- Communicating forecast uncertainty in hydro-meteorological forecasts by Maria-Helena Ramos, Thibault Mathevet, Jutta Thielen, and Florian Pappenberger.
- Communicating Risk and Uncertainty: Science, Technology, and Disasters at the Crossroads by Havid´an Rodr´ıguez, Walter D´ıaz, Jenniffer M. Santos, and Benigno E. Aguirre.
#3: Long-run changes that might necessitate policy responses or long-term mitigation or adaptation strategies, such as climate change
In marked contrast to daily weather forecasting as well as extreme rare event forecasting is the forecasting of gradual long-term structural changes. Examples include climate change, economic growth, changes in the size and composition of the population, and technological progress. Here, the general recommendation is clear and detailed communication of uncertainty using multiple formats, with the format tailored to the types of decisions that will be based on the information.
On the subject of communicating uncertainty in climate change, I found the paper Communicating uncertainty: lessons learned and suggestions for climate change assessment by Anthony Patt and Suraje Dessai. The paper is quite interesting (and has been referenced by some of the other papers mentioned in this post).
The paper identifies three general sources of uncertainty:
- Epistemic uncertainty arises from incomplete knowledge of processes that influence events.
- Natural stochastic uncertainty refers to the chaotic nature of the underlying system (in this case, the climate system).
- Human reflexive uncertainty refers to uncertainty in human activity that could affect the system. Some of the activity may be undertaken specifically in response to the forecast.
This is somewhat similar to, but not directly mappable to, the classification of sources of uncertainty by Gavin Schmidt from NASA that I discussed in my post on weather and climate forecasting:
- Initial condition uncertainty: This form of uncertainty dominates short-term weather forecasts (though not necessarily the very short term weather forecasts; it seems to matter the most for intervals where numerical weather prediction gets too uncertain but long-run equilibrating factors haven't kicked in). Over timescales of several years, this form of uncertainty is not influential.
- Scenario uncertainty: This is uncertainty that arises from lack of knowledge of how some variable (such as carbon dioxide levels in the atmosphere, or levels of solar radiation, or aerosol levels in the atmosphere, or land use patterns) will change over time. Scenario uncertainty rises over time, i.e., scenario uncertainty plagues long-run climate forecasts far more than it plagues short-run climate forecasts.
- Structural uncertainty: This is uncertainty that is inherent to the climate models themselves. Structural uncertainty is problematic at all time scales to a roughly similar degree (some forms of structural uncertainty affect the short run more whereas some affect the long run more).
Section 2 of the paper has a general discussion of interpreting and communicating probabilities. One of the general points made is that the more extreme the event, the lower people's mental probability threshold for verbal descriptions of likelihood. For instance, for a serious disease, the probability threshold for "very likely" may be 30%, whereas for a minor ailment, it may be 90% (these numbers are my own, not from the paper). The authors also discuss the distinction between frequentist and Bayesian approaches and claim that the frequentist approach is better suited to assimilating multiple pieces of information, and therefore, frequentist framings should be preferred to Bayesian framings when communicating uncertainty:
As should already be evident, whether the task of estimating and responding to uncertainty is framed in stochastic (usually frequentist) or epistemic (often Bayesian) terms can strongly influence which heuristics people use, and likewise lead to different choice outcomes . Framing in frequentist terms on the one hand promotes the availability heuristic, and on the other hand promotes the simple acts of multiplying, dividing, and counting. Framing in Bayesian terms, by contrast, promotes the representativeness heuristic, which is not well adapted to combining multiple pieces of information. In one experiment, people were given the problem of estimating the chances that a person has a rare disease, given a positive result from a test that sometimes generates false positives. When people were given the problem framed in terms of a single patient receiving the diagnostic test, and the base probabilities of the disease (e.g., 0.001) and the reliability of the test (e.g., 0.95), they significantly over-estimate the chances that the person has the disease (e.g., saying there is a 95% chance). But when people were given the same problem framed in terms of one thousand patients being tested, and the same probabilities for the disease and the test reliability, they resorted to counting patients, and typically arrived at the correct answer (in this case, about 2%). It has, indeed, been speculated that the gross errors at probability estimation, and indeed errors of logic, observed in the literature take place primarily when people are operating within the Bayesian probability framework, and that these disappear when people evaluate problems in frequentist terms [23,58].
The authors offer the following suggestions in the discussion section (Section 4) of their paper:
The challenge of communicating probabilistic information so that it will be used, and used appropriately, by decision-makers has been long recognized. [...] In some cases, the heuristics that people use are not well suited to the particular problem that they are solving or decision that they are making; this is especially likely for types of problems outside their normal experience. In such cases, the onus is on the communicators of the probabilistic information to help people find better ways of using the information, in such a manner that respects the users’ autonomy, full set of concerns and goals, and cognitive perspective.
That these difficulties appear to be most pronounced when dealing with predictions of one-time events, where the probability estimates result from a lack of complete confidence in the predictive models. When people speak about such epistemic or structural uncertainty, they are far more likely to shun quantitative descriptions, and are far less likely to combine separate pieces of information in ways that are mathematically correct. Moreover, people perceive decisions that involve structural uncertainty as riskier, and will take decisions that are more risk averse. By contrast, when uncertainty results from well-understood stochastic processes, for which the probability estimate results from counting of relative frequencies, people are more likely to work effectively with multiple pieces of information, and to take decisions that are more risk neutral.
In many ways, the most recent approach of the IPCC WGI responds to these issues. Most of the uncertainties with respect to climate change science are in fact epistemic or structural, and the probability estimates of experts reflect degrees of confidence in the occurrence of one-time events, rather than measurement of relative frequencies in relevant data sets. Using probability language, rather than numerical ranges, matches people’s cognitive framework, and will likely make the information both easier to understand, and more likely to be used. Moreover, defining the words in terms of specific numerical ranges ensures consistency within the report, and does allow comparison of multiple events, for which the uncertainty may derive from different sources.
We have already mentioned the importance of target audiences in communicating uncertainties, but this cannot be emphasized enough. The IPCC reports have a wide readership so a pluralistic approach is necessary. For example, because of its degree of sophistication, the water chapter could communicate uncertainties using numbers, whereas the regional chapters might use words and the adaptive capacity chapter could use narratives. “Careful design of communication and reporting should be done in order to avoid information divide, misunderstandings, and misinterpretations. The communication of uncertainty should be understandable by the audience. There should be clear guidelines to facilitate clear and consistent use of terms provided. Values should be made explicit in the reporting process” .
However, by writing the assessment in terms of people’s intuitive framework, the IPCC authors need to understand that this intuitive framework carries with it several predictable biases. [...]
The literature suggests, and the two experiments discussed here further confirm, that the approach of the IPCC leaves room for improvement. Further, as the literature suggests, there is no single solution for these potential problems, but there are communication practices that could help. [...]
Finally, the use of probability language, instead of numbers, addresses only some of the challenges in uncertainty communication that have been identified in the modern decision support literature. Most importantly, it is important in the communication process to address how the information can and should be used, using heuristics that are appropriate for the particular decisions. [...] Obviously, there are limits to the length of the report, but within the balancing act of conciseness and clarity, greater attention to full dimensions of uncertainty could likely increase the chances that users will decide to take action on the basis of the new information.
I just started reading Total Freedom by Chris Sciabarra (warning: politics book), and a good half of it seems to be about 'dialectics' as a thinking tool, but it's been total rubbish in trying to explain it. From poking around on the internet, it seems to have been a proto-systems theory that became a Marxist shibboleth.
Am I understanding that correctly? The LW survey says about 1 in 4 of us is a communist, so I'm hoping someone can point to me resources or something. Also, I've read through most of the sequences, and it didn't use the word dialectics in there at all, which seems strange if it's such a useful thinking tool. Is there something wrong with it as an epistemological practice? Is the word just outdated?
Sorry about the (tangentially) political post, I'm just kind of confused. Help?
In an earlier post, I looked at some general domains of forecasting. This post looks at some more specific classes of forecasting, some of which overlap with the general domains, and some of which are more isolated. The common thread to these classes of forecasting is that they involve rare events.
Different types of forecasting for rare events
When it comes to rare events, there are three different classes of forecasts:
- Point-in-time-independent probabilistic forecasts: Forecasts that provide a probability estimate for the event occurring in a given timeframe, but with no distinction based on the point in time. In other words, the forecast may say "there is a 5% chance of an earthquake higher than 7 on the Richter scale in this geographical region in a year" but the forecast is not sensitive to the choice of year. These are sufficient to inform decisions on general preparedness. In the case of earthquakes, for instance, the amount of care to be taken in building structures can be determined based on these forecasts. On the other hand, it's useless for deciding the timing of specific activities.
- Point-in-time-dependent probabilistic forecasts: Forecasts that provide a probability estimate that varies somewhat over time based on history, but aren't precise enough for a remedial measure that substantially offsets major losses. For instance, if I know that an earthquake will occur in San Francisco in the next 6 months with probability 90%, it's still not actionable enough for a mass evacuation of San Francisco. But some preparatory measures may be undertaken.
- Predictions made with high confidence (i.e., a high estimated probability when the event is predicted) and a specific time, location, and characteristics: Precise predictions of date and time, sufficient for remedial measures that substantially offset major losses (but possibly at huge, if much smaller, cost). The situation with hurricanes, tornadoes, and blizzards is roughly in this category.
Statistical distributions: normal distributions versus power law distributions
Perhaps the most ubiquitous distribution used in probability and statistics is the normal distribution. The normal distribution is a symmetric distribution whose probability density function decays superexponentially with distance from the mean (more precisely, it is exponential decay in the square of the distance). In other words, the probability decays slowly at the beginning, and faster later. Thus, for instance, the ratio of pdfs for 2 standard deviations from the mean and 1 standard deviation from the mean is greater than the ratio of pdfs for 3 standard deviations from the mean and 2 standard deviations from the mean. To give explicit numbers: about 68.2% of the distribution lies between -1 and +1 SD, 95.4% lies between -2 and +2 SD, 99.7% lies between -3 and +3 SD, and 99.99% lies between -4 and +4 SD. So the probability of being more than 4 standard deviations is less than 1 in 10000.
If the probability distribution for intensity looks (roughly) like a normal distribution, then high-intensity events are extremely unlikely. So, if the probability distribution for intensity is normal, we do not have to worry about high-intensity events much.
The types of situations where rare event forecasting becomes more important is where events that are high-intensity, or "extreme" in some sense, occur rarely but not as rarely as in a normal distribution. We say that the tails of such distributions are thicker than those of the normal distribution, and the distributions are termed "thick-tailed" or "fat-tailed" distributions. [Formally, the thickness of tails is measured using a quantity called excess kurtosis, which sees how the fourth central moment compares with the square of the second central moment (the second central moment is the variance, and it is the square of the standard deviation), then subtracts off the number 3, which is the corresponding value for the normal distribution. If the excess kurtosis for a distribution is positive, it is a thick-tailed distribution.]
The most common example of such distributions that is of interest to us is power law distributions. Here, the probability is proportional to a negative power. So the decay is like a power. If you remember some basic precalculus/calculus, you'll recall that power functions (such as the square function or cube function) grow more slowly than exponential functions. So power law distributions decay more subexponentially: they decay more slowly than exponential decay (to be more precise, the decay starts off as fast, then slows down). As noted above, the pdf for the normal distribution decays exponentially in the square of the distance from the mean, so the upshot is that power law distributions decay more slowly than normal distributions.
For most of the rare event classes we discuss, to the extent that it has been possible to pin down a distribution, it has looked a lot more like a power law distribution than a normal distribution. Thus, rare events need to be heeded. (There's obviously a selection effect here: for those cases where the distributions are close to normal, forecasting rare events just isn't that challenging, so they wouldn't be included in my post).
UPDATE: Aaron Clauset, who appears in #4, pointed me (via email) to his Rare Events page, containing the code (Matlab and Python) that he used in his terrorism statistics paper mentioned as an update at the bottom of #4. He noted in the email that the statistical methods are fairly general, so interested people could use the code if they were interested in cross-applying to rare events in other domains.
One of the more famous advocates of the idea that people overestimate the ubiquity of normal distributions and underestimate the prevalence of power law distributions is Nassim Nicholas Taleb. Taleb calls the world of normal distributions Mediocristan (the world of mediocrity, where things are mostly ordinary and weird things are very very rare) and the world of power law distributions Extremistan (the world of extremes, where rare and weird events are more common). Taleb has elaborated on this thesis in his book The Black Swan, though some parts of the idea are also found in his earlier book Fooled by Randomness.
I'm aware that a lot of people swear by Taleb, but I personally don't find his writing very impressive. He does cover a lot of important ideas but they didn't originate with him, and he goes off on a lot of tangents. In contrast, I found Nate Silver's The Signal and the Noise a pretty good read, and although it wasn't focused on rare events per se, the parts of it that did discuss such forecasting were used by me in this post.
(Sidenote: My criticism of Taleb is broadly similar to that offered by Jamie Whyte here in Standpoint Magazine. Also, here's a review by Steve Sailer of Taleb. Sailer is much more favorably inclined to the normal distribution than Taleb is, and this is probably related to his desire to promote IQ distributions/The Bell Curve type ideas, but I think many of Sailer's criticisms are spot on).
Examples of rare event classes that we discuss in this post
The classes discussed in this post include:
- Earthquakes: Category #1, also, hypothesized to follows a power law distribution.
- Volcanoes: Category #2.
- Extreme weather events (hurricanes/cyclones, tornadoes, blizzards): Category #3.
- Major terrorist acts: Questionable, at least Category #1, some argue it is Category #2 or Category #3. Hypothesized to follow a power law distribution.
- Power outages (could be caused by any of 1-4, typically 3)
- Server outages (could be caused by 5)
- Financial crises
- Global pandemics, such as the 1918 flu pandemic (popularly called the "Spanish flu") that, according to Wikipedia, "infected 500 million people across the world, including remote Pacific islands and the Arctic, and killed 50 to 100 million of them—three to five percent of the world's population." They probably fall under Category #2, but I couldn't get a clear picture. (Pandemics were not in the list at the time of original publication of the post; I added them based on a comment suggestion).
- Near-earth object impacts (not in the list at the time of original publication of the post; I added them based on a comment suggestion).
Other examples of rare events would also be appreciated.
Earthquake prediction remains mostly in category 1: there are probability estimates of the occurrence of earthquakes of a given severity or higher within a given timeframe, but these estimates do not distinguish between different points in time. In The Signal and the Noise, statistician and forecasting expert Nate Silver talks to Susan Hough (Wikipedia) of the United States Geological Survey and describes what she has to say about the current state of earthquake forecasting:
What seismologists are really interested in— what Susan Hough calls the “Holy Grail” of seismology— are time-dependent forecasts, those in which the probability of an earthquake is not assumed to be constant across time.
Silver, Nate (2012-09-27). The Signal and the Noise: Why So Many Predictions Fail-but Some Don't (p. 154). Penguin Group US. Kindle Edition.
The whole Silver chapter is worth reading, as is the Wikipedia page on earthquake prediction, which covers much of the same ground.
In fact, even for the time-independent earthquake forecasting, currently the best known forecasting method is the extremely simple Gutenberg-Richter law, which says that for a given location, the frequency of earthquakes obeys a power law with respect to intensity. Since the Richter scale is logarithmic (to base 10), this means that adding a point on the Richter scale makes the frequency of earthquakes decrease to a fraction of the previous value. Note that the Gutenberg-Richter law can't be the full story: there are probably absolute limits on the intensity of the earthquake (some people believe that an earthquake of intensity 10 or higher is impossible). But so far, it seems to have the best track record.
Why haven't we been able to come up with better models? This relates to the problem of overfitting common in machine learning and statistics: when the number of data points is very small, and quite noisy, then trying a more complicated law (with more freely varying parameters) ends up fitting the noise in the data rather than the signal, and therefore ends up being a poor fit for new, out-of-sample data. The problem is dealt with in statistics using various goodness of fit tests and measures such as the Akaike information criterion, and it's dealt with in machine learning using a range of techniques such as cross-validation, regularization, and early stopping. These approaches can generally work well in situations where there is lots of data and lots of parameters. But in cases where there is very little data, it often makes sense to just manually select a simple model. The Gutenberg-Richter law has two parameters, and can be fit using a simple linear regression. There isn't enough information to reliably fit even modestly more complicated models, such as the characteristic earthquake models, and past attempts based on characteristic earthquakes failed in both directions (a predicted earthquake at Parkfield never materialized, and the probability of the 2011 Japan earthquake was underestimated by the model relative to the Gutenberg-Richter law).
Silver's chapter and other sources do describe some possibilities for short-term forecasting based on foreshocks and aftershocks, and seismic disturbances, but note considerable uncertainty.
The existence of time-independent forecasts for earthquakes has probably had major humanitarian benefits. Building codes and standards, in particular, can adapt to the probability of earthquakes. For instance, building standards are greater in the San Francisco Bay Area than in other parts of the United States, partly because of the greater probability of earthquakes. Note also that Gutenberg-Richter does make out-of-sample predictions: it can use the frequency of low-intensity earthquakes to predict the frequency of high-intensity earthquakes, and therefore obtain a time-independent forecast of such an earthquake in a region that may never have experienced it.
#2: Volcanic eruptions
Volcanoes are an easier case than earthquakes. Silver's book doesn't discuss them, but the Wikipedia article offers basic information. A few points:
- Volcanic activity falls close to category #2: time-dependent forecasts can be made, albeit with considerable uncertainty.
- Volcanic activity poses less immediate risk because fewer people live close to the regions where volcanoes typically erupt.
- However, volcanic activity can affect regional and global climate for a few years (in the cooling direction), and might even shift the intercept of other long-term secular and cyclic trends in climate (the reason is that the dust particles released by volcanoes into the atmosphere reduce the extent to which solar radiation is absorbed). For instance, the 1991 Mount Pinatubo eruption is credited with causing the next 1-2 years to be cooler than they otherwise would be, masking the heating effect of a strong El Nino.
#3: Extreme weather events (lightning, hurricanes/cyclones, blizzards, tornadoes)
Forecasting for lightning and thunderstorms has improved quite a bit over the last century, and falls squarely within Category #3. In The Signal and the Noise, Nate Silver notes that the probability of an American dying from lightning has dropped from 1 in 400,000 in 1940 to 1 in 11,000,000 today, and a large part of the credit goes to better weather forecasting causing people to avoid the outdoors at the times and places that lightning might strike.
Forecasting for hurricanes and cyclones (which are the same weather phenomenon, just at different latitudes) is quite good, and getting better. It falls squarely in category #3: in addition to having general probability estimates of the likelihood of particular types of extreme weather events, we can forecast them a day or a few days in advance, allowing for preparation and minimization of negative impact.
The precision for forecasting the eye of the storm has increased about 3.5-fold in length terms (so about 12-fold in area terms) over the last 25 years. Nate Silver notes that 25 years ago, the National Hurricane Center's forecasts for where a hurricane would hit on landfall, made three days in advance, were 350 miles off on average. Now they're about 100 miles off on average. Most of the major hurricanes that hit the United States, and many other parts of the world, were forecast well in advance, and people even made preparations (for instance, by declaring holidays, or stocking up on goods). Blizzard forecasting is also fairly impressive: I was at Chicago in 2011 when a blizzard hit, and it had been forecast at least a day in advance. With tornadoes, tornado warning alerts are often issued, albeit the tornado often doesn't actually touch down even after the alert is issued (fortunately for us).
#4: Major terrorist acts
Terrorist attacks are interesting. It has been claimed that the frequency-damage relationship for terrorist attacks follows a power law. The academic paper that popularized this observation is a paper by Aaron Clauset, Maxwell Young and Kristian Gleditsch titled "On the Frequency of Severe Terrorist Attacks" (Journal of Conflict Resolution 51(1), 58 - 88 (2007)), here. Bruce Schneier wrote a blog post about a later paper by Clauset and Frederick W. Wiegel, and see also more discussion here, here, here, and here (I didn't select these links through a very discerning process; I just picked the top results of a Google Search).
Silver's book does allude to power laws for terrorism, but I couldn't find any reference to Clauset in his book (oops, seems like my Kindle search was buggy!) and says the following about Clauset:
Clauset’s insight, however, is actually quite simple— or at least it seems that way with the benefit of hindsight. What his work found is that the mathematics of terrorism resemble those of another domain discussed in this book: earthquakes.
Imagine that you live in a seismically active area like California. Over a period of a couple of decades, you experience magnitude 4 earthquakes on a regular basis, magnitude 5 earthquakes perhaps a few times a year, and a handful of magnitude 6s. If you have a house that can withstand a magnitude 6 earthquake but not a magnitude 7, would it be right to conclude that you have nothing to worry about?
Of course not. According to the power-law distribution that these earthquakes obey, those magnitude 5s and magnitude 6s would have been a sign that larger earthquakes were possible—inevitable, in fact, given enough time. The big one is coming, eventually. You ought to have been prepared.
Terror attacks behave in something of the same way. The Lockerbie bombing and Oklahoma City were the equivalent of magnitude 7 earthquakes. While destructive enough on their own, they also implied the potential for something much worse— something like the September 11 attacks, which might be thought of as a magnitude 8. It was not an outlier but instead part of the broader mathematical pattern.
Silver, Nate (2012-09-27). The Signal and the Noise: Why So Many Predictions Fail-but Some Don't (pp. 427-428). Penguin Group US. Kindle Edition.
So terrorist attacks are at least in category 1. What about categories 2 and 3? Can we forecast terrorist attacks the way we can forecast volcanoes, or the way we can forecast hurricanes. One difference between terrorist acts and the "acts of God" discussed so far is that to the extent one has inside information about a terrorist attack that's good enough to predict it with high accuracy, it's usually also sufficient to actually prevent the terrorist attack. So Category 3 becomes trickier to define. Should we count the numerous foiled terrorist plots as evidence that terrorist acts can be successfully "predicted" or should we only consider successful terrorist acts in the denominator? And another complication is that terrorist acts are responsive to geopolitical decisions in ways that earthquakes are definitely not, with extreme weather events falling somewhere in between.
As for Category 2, the evidence is unclear, but it's highly likely that terrorist acts can be forecast in a time-dependent fashion to quite a degree. If you want to crunch the numbers yourself, the Global Terrorism Database (website, Wikipedia) and Suicide Attack Database (website, Wikipedia) are available for you to use. I discussed some general issues with political and conflict forecasting in my earlier post on the subject.
UPDATE: Clauset emailed me with some corrections to this section of the post, which I have made. He also pointing to a recent paper he co-wrote with Ryan Woodward about estimating the historical and future probabilities of terror events, available on the ArXiV. Here's the abstract:
Quantities with right-skewed distributions are ubiquitous in complex social systems, including political conflict, economics and social networks, and these systems sometimes produce extremely large events. For instance, the 9/11 terrorist events produced nearly 3000 fatalities, nearly six times more than the next largest event. But, was this enormous loss of life statistically unlikely given modern terrorism's historical record? Accurately estimating the probability of such an event is complicated by the large fluctuations in the empirical distribution's upper tail. We present a generic statistical algorithm for making such estimates, which combines semi-parametric models of tail behavior and a nonparametric bootstrap. Applied to a global database of terrorist events, we estimate the worldwide historical probability of observing at least one 9/11-sized or larger event since 1968 to be 11-35%. These results are robust to conditioning on global variations in economic development, domestic versus international events, the type of weapon used and a truncated history that stops at 1998. We then use this procedure to make a data-driven statistical forecast of at least one similar event over the next decade.
#5: Power outages
Power outages could have many causes. Note that insofar as we can forecast the phenomena underlying the causes, this can be used to reduce, rather than simply forecast, power outages.
- Poor load forecasting, i.e., electricity companies don't forecast how much demand there will be and don't prepare supplies adequately. This is less of an issue in developed countries, where the power systems are more redundant (at some cost to efficiency): Note here that the power outage occurs due to a failure of a more mundane forecasting exercise. Forecasting the frequency of power outages due to this cause is basically an exercise in calibrating the quality of the mundane forecasting exercise.
- Abrupt or significant shortages in fuel, often for geopolitical reasons. This therefore ties in with the general exercise of geopolitical forecasting (see my earlier post on the subject). This seems rare in the modern world, due to the considerably redundancy built into global fuel supplies.
- Disruption of power lines or power supply units due to weather events. The most common causes appear to be lightning, ice, wind, rain, and flooding. This ties in with #3, and with my weather forecasting and climate forecasting posts. This is the most common cause of power outages in developed countries with advanced electricity grids (see, for instance, here and here).
- Disruption by human or animal activity, including car accidents and animals climbing onto and playing with the power lines.
- Perhaps the most niche source of power outages, that many people may be unaware of, is geomagnetic storms (Wikipedia). These are quite rare, but can result in major power blackouts. Geomagnetic storms were discussed in past MIRI posts (here and here). Geomagnetic storms are low-frequency and low-probability events but with potentially severe negative impact.
My impression is that when it comes to power outages, we are at Category 2 in forecasting. Load forecasting can identify seasons, times of the day, and special occasions when power demand will be high. Note that the infrastructure needs to built for peak capacity.
We can't quite be in Category 3, because in cases where we can forecast more finely, we could probably prevent the outage anyway.
What sort of preventive measures do people undertake with knowledge of the frequency of power outages? In places where power outages are more likely, people are more likely to have backup generators. People may be more likely to use battery-powered devices. If you know that a power outage is likely to happen in the next few days, you might take more care to charge the batteries on your devices.
#6: Server outages
In our increasingly connected world, websites going down can have a huge effect on the functioning of the Internet and of the world economy. As with power infrastructure, the complexity of server infrastructure needed to increase uptime increases very quickly. The point is that routing around failures at different points in the infrastructure requires redundancy. For instance, if any one server fails 10% of the time, and the failures of different components are independent, you'd need two servers to get to a 1% failure rate. But in practice, the failures aren't independent. For instance, having loads of servers in a single datacenter covers the risk of any given server there crashing, but it doesn't cover the risk of the datacenter itself getting disconnected (e.g., losing electricity, or getting disconnected from the Internet, or catching fire). So now we need multiple datacenters. But multiple datacenters are far from each other, so that increases the time costs of synchronization. And so on. For more detailed discussions of the issues, see here and here.
My impression is that server outages are largely Category 1: we can use the probability of outages to determine the trade-off between the cost of having redundant infrastructure and the benefit of more uptime. There is an element of Category 2: in some cases, we have knowledge that traffic will be higher at specific times, and additional infrastructure can be brought to bear for those times. As with power infrastructure, server infrastructure needs to be built to handle peak capacity.
#7: Financial crises
The forecasting of financial crises is a topic worthy of its own post. As with climate science, financial crisis forecasting has the potential for heavy politicization, given the huge stakes both of forecasting financial crises and of any remedial or preventative measures that may be undertaken. In fact, the politicization and ideology problem is probably substantially worse in financial crisis forecasting. At the same time, real-world feedback occurs faster, providing more opportunity for people to update their beliefs and less scope for people getting away with sloppiness because their predictions take too long to evaluate.
A literally taken strong efficient market hypothesis (EMH) (Wikipedia) would suggest that financial crises are almost impossible to forecast, while a weaker reading of the EMH would suggest that the financial market is efficient (Wikipedia): it's hard to make money off the business of forecasting financial crises (for instance, you may know that a financial crisis is imminent with high probability, but the element of uncertainty, particularly with regards to timing, can destroy your ability to leverage that information to make money). On the other hand, there are a lot of people, often subscribed to competing schools of economic thought, who successfully forecast the 2007-08 financial crisis, at least in broad strokes.
Note that there are people who reject the EMH, yet claim that financial crises are very hard to forecast in a time-dependent fashion. Among them is Nassim Nicholas Taleb, as described here. Interestingly, Taleb's claim to fame appears to have been that he was able to forecast the 2007-08 financial crisis, albeit it was more of a time-independent forecast than a specific timed call. The irony was noted by by Jamie Whyte here in Standpoint Magazine.
Economic Predictions records predictions made by many prominent people and how they compared to what transpired. In particular, this page on their website notes how many of the top investors, economists, and bureaucrats missed the financial crisis, but also identifies some exceptions: Dean Baker, Med Jones, Peter Schiff, and Nouriel Roubini. The page also discusses other candidates who claim to have forecasted the crisis in advance, and reasons why they were not included. While I think they've put in a fair deal of effort into their project, I didn't see good evidence that they have a strong grasp of the underlying fundamental issues they are discussing.
An insightful general overview of the financial crisis is found in Chapter 1 of Nate Silver's The Signal and the Noise, a book that I recommend you read in its entirety. Silver notes four levels of forecasting failure.
- The housing bubble can be thought of as a poor prediction. Homeowners and investors thought that rising prices implied that home values would continue to rise, when in fact history suggested this made them prone to decline.
- There was a failure on the part of the ratings agencies, as well as by banks like Lehman Brothers, to understand how risky mortgage-backed securities were. Contrary to the assertions they made before Congress, the problem was not that the ratings agencies failed to see the housing bubble. Instead, their forecasting models were full of faulty assumptions and false confidence about the risk that a collapse in housing prices might present.
- There was a widespread failure to anticipate how a housing crisis could trigger a global financial crisis. It had resulted from the high degree of leverage in the market, with $ 50 in side bets staked on every $ 1 that an American was willing to invest in a new home.
- Finally, in the immediate aftermath of the financial crisis, there was a failure to predict the scope of the economic problems that it might create. Economists and policy makers did not heed Reinhart and Rogoff’s finding that financial crises typically produce very deep and long-lasting recessions.
Silver, Nate (2012-09-27). The Signal and the Noise: Why So Many Predictions Fail-but Some Don't (pp. 42-43). Penguin Group US. Kindle Edition.
Silver finds a common thread among all the failures (emphases in original):
There is a common thread among these failures of prediction. In each case, as people evaluated the data, they ignored a key piece of context:
- The confidence that homeowners had about housing prices may have stemmed from the fact that there had not been a substantial decline in U.S. housing prices in the recent past. However, there had never before been such a widespread increase in U.S. housing prices like the one that preceded the collapse.
- The confidence that the banks had in Moody’s and S& P’s ability to rate mortgage-backed securities may have been based on the fact that the agencies had generally performed competently in rating other types of financial assets. However, the ratings agencies had never before rated securities as novel and complex as credit default options.
- The confidence that economists had in the ability of the financial system to withstand a housing crisis may have arisen because housing price fluctuations had generally not had large effects on the financial system in the past. However, the financial system had probably never been so highly leveraged, and it had certainly never made so many side bets on housing before.
- The confidence that policy makers had in the ability of the economy to recuperate quickly from the financial crisis may have come from their experience of recent recessions, most of which had been associated with rapid, “V-shaped” recoveries. However, those recessions had not been associated with financial crises, and financial crises are different.
There is a technical term for this type of problem: the events these forecasters were considering were out of sample. When there is a major failure of prediction, this problem usually has its fingerprints all over the crime scene.
Silver, Nate (2012-09-27). The Signal and the Noise: Why So Many Predictions Fail-but Some Don't (p. 43). Penguin Group US. Kindle Edition.
While I find Silver's analysis plausible and generally convincing, I don't think I have enough of an inside-view understanding of the issue.
A few other resources that I found, but didn't get a chance to investigate, are listed below:
- Forecasting Financial Crisis website includes many publications related to financial crisis forecasting and prevention.
- How to forecast a financial crisis (article in Financial Times).
- Forecasting the Global Financial Crisis, a working paper by Daniela Bragoli.
I haven't investigated this thoroughly, but here are a few of my impressions and findings:
- I think that pandemics stand in relation to ordinary epidemiology in the same way that extreme weather events stand in relation to ordinary weather forecasting. In both cases, the main way we can get better at forecasting the rare and high-impact events is by getting better across the board. There is a difference that makes the relation between moderate disease outbreaks and pandemics even more important than the corresponding case for weather: measures taken quickly to react to local disease outbreaks can help prevent global pandemics.
- Chapter 7 of Nate Silver's The Signal and the Noise, titled "Role Models", discusses forecasting and prediction in the domain of epidemiology. The goal of epidemiologists is to obtain predictive models that have a level of accuracy and precision similar to those used for the weather. However, the greater complexity of human behavior, as well as the self-fulfilling and self-canceling nature of various predictions, makes the modeling problem harder. Silver notes that agent-based modeling (Wikipedia) is one of the commonly used tools. Silver cites a few examples from recent history where people were overly alarmed about possible pandemics, when the reality turned out to be considerably milder. However, the precautions taken due to the alarm may still have saved lives. Silver talks in particular of the 1976 swine flu outbreak (where the reaction turned out be grossly disproportional to the problem, and caused its own unintended consequences) and the 2009 flu pandemic.
- In recent years, Google Flu Trends (website, Wikipedia) has been a common technique in identifying and taking quick action against the flu. Essentially, Google uses the volume of web search for flu-related terms by geographic location to identify the incidence of the flu by geographic location. It offers an early "leading indicator" of flu incidence compared to official reports, that are published after a time lag. However, Google Flu Trends has run into problems of reliability: news stories about the flu might prompt people to search for flu-related terms, even if they aren't experiencing symptoms of the flu. Or it may even be the case that Google's own helpful search query completions get people to search for flu-related terms once other people start searching for the term. Tim Harford discusses the problems in the Financial Times here. I think Silver doesn't discuss this (which is a surprise, since it would have fit well with the theme of his chapter).
#9: Near-earth object impacts
I haven't looked into this category in sufficient detail. I'll list below the articles I read.
- Wikipedia pages on asteroid-impact avoidance, near-earth object, B612 Foundation, Minor Planet Center, NEOShield, Sentry (monitoring system), Spaceguard, and Chelyabinsk meteor.
- GiveWell's shallow overview of asteroid detection.
- The hazard of near-earth asteroid impacts (PDF) by Clark R. Chapman (gated copy), Earth and Planetary Science Letters, Volume 222, Issue 1, 15 May 2004, Pages 1–15.
- A 500-kiloton airburst over Chelyabinsk and an enhanced hazard from small impactors, published in Nature after the Chelyabinsk meteor.
Adoption and twin studies are very important for determining the impact of genes versus environment in the modern world (and hence the likely impact of various interventions). Other types of studies tend to show larger effects for some types of latter interventions, but these studies are seen as dubious, as they may fail to adjust for various confounders (eg families with more books also have more educated parents).
But adoption studies have their own confounders. The biggest ones are that in many countries, the genetic parents have a role in choosing the adoptive parents. Add the fact that adoptive parents also choose their adopted children, and that various social workers and others have great influence over the process, this would seem a huge confounder interfering with the results.
This paper also mentions a confounder for some types of twin studies, such as identical versus fraternal twins. They point out that identical twins in the same family will typically get a much greater shared environment than fraternal twins, because people will treat them much more similarly. This is to my mind quite a weak point, but it is an issue nonetheless.
Since I have very little expertise in these areas, I was just wondering if anyone knew about efforts to estimate the impact of these confounders and adjust for them.
- Houston, TX: 12 July 2014 02:00PM
- Upper Canada LW Megameetup: Ottawa, Toronto, Montreal, Waterloo, London: 18 July 2014 07:00PM
- Moscow meet up: 06 July 2014 02:00PM
The remaining meetups take place in cities with regular scheduling, but involve a change in time or location, special meeting content, or simply a helpful reminder about the meetup:
- Canberra: Paranoid Debating: 12 July 2014 06:00PM
- [Melbourne] July Rationality Dojo: Disagreement: 06 July 2014 03:00PM
- Sydney Rationality Dojo - Agency: 06 July 2014 04:00PM
Locations with regularly scheduled meetups: Austin, Berkeley, Berlin, Boston, Brussels, Buffalo, Cambridge UK, Canberra, Columbus, London, Madison WI, Melbourne, Mountain View, New York, Philadelphia, Research Triangle NC, Salt Lake City, Seattle, Sydney, Toronto, Vienna, Washington DC, Waterloo, and West Los Angeles. There's also a 24/7 online study hall for coworking LWers.
Of the technologies that have a reasonable chance of come to mass market in the next 20-25 years and having a significant impact on human society, driverless cars (also known as self-driving cars or autonomous cars) stand out. I was originally planning to collect material discussing driverless cars, but Gwern has a really excellent compendium of statements about driverless cars, published January 2013 (if you're reading this, Gwern, thanks!). There have been a few developments since then (for instance, Google's announcement that it was building its own driverless car, or a startup called Cruise Automation planning to build a $10,000 driverless car) but the overall landscape remains similar. There's been some progress with understanding and navigating city streets and with handling adverse weather conditions, and it's more or less on schedule.
My question is about driverless car forecasts. Driverless Future has a good summary page of forecasts made by automobile manufacturer, insurers, and professional societies. The range of time for the arrival of the first commercial driverless cars varies between 2018 and 2030. The timeline for driverless cars to achieve mass penetration is similarly stagged between the early 2020s and 2040. (The forecasts aren't all directly comparable).
A few thoughts come to mind:
- Insurer societies and professional societies seem more conservative in their estimates than manufacturers (both automobile manufacturers and people manufacturing the technology for driverless cars). Note that the estimates of many manufacturers are centered on their projected release dates for their own driverless cars. This suggests an obvious conflict of interest: manufacturers may be incentivized to be optimistic in their projections of when driverless cars will be released, insofar as making more optimistic predictions wins them news coverage and might also improve their market valuation. (At the same time, the release dates are sufficiently far in the future that it's unlikely that they'll be held to account for false projections, so there isn't a strong incentive to be conservative the same way as there is with quarterly sales and earning forecasts). Overall, then, I'd defer more to the judgment of the professional societies, namely the IEEE and the Society of Autonomous Engineers.
- The statements compiled by Gwern point to the many legal hurdles and other thorny issues of ethics that would need to be resolved, at least partially, before driverless cars start becoming a big presence in the market.
- The general critique made by Schnaars in Megamistakes (that I discussed here) applies to driverless car technology: consumers may be unwilling to pay the added cost despite the safety benefits. Some of the quotes in Gwern's compendium reference related issues. This points further in the direction of forecasts by manufacturers being overly optimistic.
Questions for the people here:
- Do you agree with my points (1)-(3) above?
- Would you care to make forecasts for things such as: (a) the date that the first commercial driverless car will hit the market in a major country or US state? (b) the date by which over 10% of new cars sold in a large country or US state will be driverless (i.e., capable of fully autonomous operation), (c) same as (b), but over 50%, (d) the date by which over 10% of cars on the road (in a large country or US state) will be operating autonomously, (e) same as (d), but over 50%. You don't have to answer these exact questions, I'm just providing some suggestions since "forecast the future of driverless cars" is overly vague.
- What's your overall view on whether it is desirable at the margin to speed up or slow down the arrival of autonomous vehicles on the road? What factors would you consider in answering such a question?
Vincent Müller and Nick Bostrom have just released a paper surveying the results of a poll of experts about future progress in artificial intelligence. The authors have also put up a companion site where visitors can take the poll and see the raw data. I just checked the site and so far only one individual has submitted a response. This provides an opportunity for testing the views of LW members against those of experts. So if you are willing to complete the questionnaire, please do so before reading the paper. (I have abstained from providing a link to the pdf to create a trivial inconvenience for those who cannot resist temptaion. Once you take the poll, you can easily find the paper by conducting a Google search with the keywords: bostrom muller future progress artificial intelligence.)
Note: This post is part of my series on forecasting for MIRI. I recommend reading my earlier post on the general-purpose forecasting community, my post on scenario planning, and my post on futures studies. Although this post doesn't rely on those, they do complement each other.
Note 2: If I run across more domains where I have substantive things to say, I'll add them to this post (if I've got a lot to say, I'll write a separate post and add a link to it as well). Suggestions for other domains worth looking into, that I've missed below, would be appreciated.
Below, I list some examples of domains where forecasting is commonly used. In the post, I briefly describe each of the domains, linking to other posts of mine, or external sources, for more information. The list is not intended to be comprehensive. It's just the domains that I investigated at least somewhat and therefore have something to write about.
- Weather and climate forecasting
- Agriculture, crop simulation
- Business forecasting, including demand, supply, and price forecasting
- Macroeconomic forecasting
- Political and geopolitical forecasting: This includes forecasting of election results, public opinion on issues, armed conflicts or political violence, and legislative changes
- Demographic forecasting, including forecasting of population, age structure, births, deaths, and migration flows.
- Energy use forecasting (demand forecasting, price forecasting, and supply forecasting, including forecasting of conventional and alternative energy sources; borrows some general ideas from business forecasting)
- Technology forecasting
Let's look into these in somewhat more detail.
Note that for some domains, scenario planning may be more commonly used than forecasting in the traditional sense. Some domains have historically been more closely associated with machine learning, data science, and predictive analytics techniques (this is usually the case when a large number of explanatory variables are available). Some domains have been more closely associated with futures studies, that I discussed here. I've included the relevant observations for individual domains where applicable.
Climate and weather forecasting
- The best weather forecasting methods use physical models rather than statistical models (though some statistics/probability is used to tackle some inherently uncertain processes, such as cloud formation). Moreover, they use simulations rather than direct closed form expressions. Errors compound over time due to a combination of model errors, measurement errors, and hypersensitivity to initial conditions.
- There are two baseline models against which the quality of any model can be judged: persistence (weather tomorrow is predicted to be the same as weather today) and climatology (weather tomorrow is predicted to be the average of the weather on that day over the last few years). We can think of persistence and climatology as purely statistical approaches, and these already do quite well. Any approach that consistently beats them needs to run very computationally intensive weather simulations.
- Even though a lot of computing power is used in weather prediction, human judgment still adds considerable value, about 10-25%, relative to what the computer models generate. This is attributed to humans being better able to integrate historical experience and common sense into their forecasts, and can offer better sanity checks. The use of machine learning tools in sanity-checking weather forecasts might substitute for the human value-added.
- Long-run climate forecasting methods are more robust in the sense of not being hypersensitive to initial conditions. Long-run forecasts require a better understanding of the speed and strength of various feedback mechanisms and equilibrating processes, and this makes them more uncertain. Whereas the uncertainty in short-run forecasts is mostly initial condition uncertainty, the uncertainty in long run forecasts arises from scenario uncertainty, plus uncertainty about the strength of various feedback mechanisms.
With long-term climate forecasting, a common alternative to forecasting is scenario analysis, such as that used by the IPCC in its discussion of long-term climate change. An example is the IPCC Special Report on Emissions Scenarios.
In addition to my overviews of weather and climate forecasting, I also wrote a series of posts on climate change science and some of its implications. These provide some interesting insight into the different points of contention related to making long-term climate forecasts, identifying causes, and making sense of a somewhat politicized realm of discourse. My posts in the area so far are below (I'll update this list with more posts as and when I make them):
- Climate science: how it matters for understanding forecasting, materials I've read or plan to read, sources of potential bias
- Time series forecasting for global temperature: an outside view of climate forecasting
- Carbon dioxide, climate sensitivity, feedback, and the historical record: a cursory examination of the Anthropogenic Global Warming (AGW) hypothesis
- [QUESTION]: What are your views on climate change, and how did you form them?
- The insularity critique of climate science
Agriculture and crop simulation
- Predictions of agricultural conditions and crop yields are made using crop simulation models (Wikipedia, PDF overview). Crop simulation models include purely statistical models, physical models that rely on simulations, and approximate physical models that use functional expressions.
- Weather and climate predictions are a key component of agricultural prediction, because of the dependence of agricultural yield on climate conditions. Some companies, such as The Climate Corporation (website, Wikipedia) specialize in using climate prediction to make predictions and recommendations for farmers.
- Business forecasting includes forecasting of demand, supply, and price.
- Time series forecasting (i.e., trying to predict future values of a variable from past values of that variable alone) is quite common for businesses operating in environments where they have very little understanding of or ability to identify and measure explanatory variables.
- As with weather forecasting, persistence (or slightly modified versions thereof, such as trend persistence that assumes a constant rate of growth) can generally be simple to implement while coming close to the theoretical limit of what can be predicted.
- More about business forecasting can be learned from the SAS Business Forecasting Blog or the Institute of Business Forecasting and Planning website and LinkedIn group.
Two commonly used journals in business forecasting are:
- Journal of Business Forecasting (website)
- International Journal of Business Forecasting and Marketing Intelligence (website)
Many of the time series used in the Makridakis Competitions (that I discussed in my review of historical evaluations of forecasting) come from businesses, so the lessons of that competition can broadly be said to apply to the realm of business forecasting (the competition also uses a few macroeconomic time series).
There is a mix of explicit forecasting models and individual judgment-based forecasters in the macroeconomic forecasting arena. However, unlike the case of weather forecasting, where the explicit forecasting models (or more precisely, the numerical weather simulations) improve forecast accuracy to a level that would be impossible for unaided humans, the situation with macroeconomic forecasting is more ambiguous. In fact, the most reliable macroeconomic forecasts seem to arise by taking averages of the forecasts of a reasonably large number of expert forecasters, each using their own intuition, judgment, or formal model. For an overview of the different examples of survey-based macroeconomic forecasting and how they compare with each other, see my earlier post on the track record of survey-based macroeconomic forecasting.
Political and geopolitical forecasting
I reviewed political and geopolitical forecasting, including forecasting for political conflicts and violence, in this post. A few key highlights:
- This is the domain where Tetlock did his famous work showing that experts don't do a great job of predicting things, as described in his book Expert Political Judgment. I discussed Tetlock's work briefly in my review of historical evaluations of forecasting.
- Currently, the most reliable source of forecasts for international political questions is The Good Judgment Project (website, Wikipedia), which relies on aggregating the judgments of contestants who are given access to basic data and are allowed to use web searches. The GJP is co-run by Tetlock. For election forecasting in the United States, PollyVote (website, Wikipedia), FiveThirtyEight (website, Wikipedia), and prediction markets such as Intrade (website, Wikipedia) and the Iowa Electonic Markets (website, Wikipedia) are good forecast sources. Of these, PollyVote appears to have done the best, but the others have been more widely used.
- Quantitative approaches to prediction rely on machine learning and data science, combined with text analysis of news of political events.
Forecasting of future population is a tricky business, but some aspects are easier to forecast than others. For instance, the population of 25-year-olds 5 years from now can be determined with reasonable precision by knowing the population of 20-year-olds now. Other variables, such as birth rates, are harder to predict (they can go up or down fast, at least in principle) but in practice, assuming level persistence or trend persistence can often offer reasonably good forecasts over the short term. While there are long-run trends (such as a trend of decline in both period fertility and total fertility) I don't know how well these declines were predicted. I wrote up some of my findings on the recent phenomenon of ultra-low fertility in many countries, so I have some knowledge of fertility trends, but I did not look systematically into the question of whether people were able to correctly forecast specific trends.
Gary King (Wikipedia) has written a book on demographic forecasting and also prepared slides covering the subject. I skimmed through his writing, but not enough to comment on it. It seems like mostly simple mathematics and statistics, tailored somewhat to the context of demographics.
With demographics, depending on context, scenario analyses may be more useful than forecasts. For instance, land use planning or city development may be done keeping in mind different possibilities for how the population and age structure might change.
Energy use forecasting (demand and supply)
Short-term energy use forecasting is often treated as a data science or predictive modeling problem, though ideas from general-purpose forecasting also apply. You can get an idea of the state of energy use forecasting by checking out the Global Energy Forecasting Competition (website, Wikipedia), carried out by a team led by Dr. Tao Hong, and cooperating with data science competitions company Kaggle (website, Wikipedia), some of the IEEE working groups, and the International Journal of Forecasting (one of the main journals of the forecasting community).
For somewhat more long-term energy forecasting, scenario analyses are more common. Energy is so intertwined with the global economy that an analysis of long-term energy use often involves thinking about many other elements of the world.
Shell (the organization to pioneer scenario analysis for the private sector) publishes some of its scenario analyses online at the Future Energy Scenarios page. While the understanding of future energy demand and supply is a driving force for the scenario analyses, they cover a wide range of aspects of society. For instance, the New Lens Scenario published in 2012 described two candidate futures for how the world might unfold till 2100, a "Mountains" future where governments played a major role and coordinated to solve global crises, and an "Oceans" future that was more decentralized and market-driven. (For a critique of Shell's scenario planning, see here). Shell competitor BP also publishes an Energy Outlook that is structured more as a forecast than as a scenario analysis, but does briefly consider alternative assumptions in a fashion similar to scenario analysis.
Many people in the LessWrong audience might find technology forecasting to be the first thing that crosses their minds when the topic of forecasting is raised. This is partly because technology improvements are quite salient. Improvements in computing are closely linked with the possibility of an Artificial General Intelligence. Famous among the people who view technology trends as harbingers of superintelligence is technologist and inventor Ray Kurzweil, who has been evaluated on LessWrong before. Website such as KurzweilAI.net and Exponential Times have popularized the idea of rapid, unprecedented, exponential growth, that despite its fast pace is somewhat predictable because of the close-to-exponential pattern.
One other point about technology forecasting: compared to other types of forecasting, technology forecasting is more intricately linked with the domain of futures studies (that I described here). Why technology forecasting specifically? Futures studies seems designed more for studying and bringing about change rather than determining what will happen at or by a specific time. Technology forecasting, unlike other forms of forecasting, is forecasting changes in the technology that we use to operate our lives. So this is the most transformative forecasting domain, and naturally attracts more attention from futures studies.
Note: Please see this post of mine for more on the project, my sources, and potential sources for bias.
One of the categories of critique that have been leveled against climate science is the critique of insularity. Broadly, it is claimed that the type of work that climate scientists are trying to do draws upon insight and expertise in many other domains, but climate scientists have historically failed to consult experts in those domains or even to follow well-documented best practices.
Note: I wrote a preliminary version of this before drafting the post, but after having done most of the relevant investigation. I reviewed and edited it prior to publication. Note also that I don't justify these takeaways explicitly in my later discussion, because a lot of these come from general intuitions of mine and it's hard to articulate how the information I received explicitly affected my reaching the takeaways. I might discuss the rationales behind these takeaways more in a later post.
- Many of the criticisms are broadly on the mark: climate scientists should have consulted best practices in other domains, and in general should have either followed them or clearly explained the reasons for divergence.
- However, this criticism is not unique to climate science: academia in general has suffered from problems of disciplines being relatively insular (UPDATE: Here's Robin Hanson saying something similar). And many similar things may be true, albeit in different ways, outside academia.
- One interesting possibility is that bad practices here operate via founder effects: for an area that starts off as relatively obscure and unimportant, setting up good practices may not be considered important. But as the area grows in importance, it is quite rare for the area to be cleaned up. People and institutions get used to the old ways of doing things. They have too much at stake to make reforms. This does suggest that it's important to get things right early on.
- (This is speculative, and not discussed in the post): The extent of insularity of a discipline seems to be an area where a few researchers can have significant effect on the discipline. If a few reasonably influential climate scientists had pushed for more integration with and understanding of ideas from other disciplines, the history of climate science research would have been different.
Relevant domains they may have failed to use or learn from
- Forecasting research: Although climate scientists were engaging in an exercise that had a lot to do with forecasting, they neither cited research nor consulted experts in the domain of forecasting.
- Statistics: Climate scientists used plenty of statistics in their analysis. They did follow the basic principles of statistics, but in many cases used them incorrectly or combined them with novel approaches that were nonstandard and did not have clear statistical literature justifying the use of such approaches.
- Programming and software engineering: Climate scientists used a lot of code both for their climate models and for their analyses of historical climate. But their code failed basic principles of decent programming, let alone good software engineering principles such as documentation, unit testing, consistent variable names, and version control.
- Publication of data, metadata, and code: This is a phenomenon becoming increasingly common in some other sectors of academia and industry. Climate scientists they failed to learn from econometrics and biomedical research, fields that had been struggling with some qualitatively similar problems and that had been moving to publishing data, metadata, and code.
Let's look at each of these critiques in turn.
Critique #1: Failure to consider forecasting research
We'll devote more attention to this critique, because it has been made, and addressed, cogently in considerable detail.
J. Scott Armstrong (faculty page, Wikipedia) is one of the big names in forecasting. In 2007, Armstrong and Kesten C. Green co-authored a global warming audit (PDF of paper, webpage with supporting materials) for the Forecasting Principles website. that was critical of the forecasting exercises by climate scientists used in the IPCC reports.
Armstrong and Green began their critique by noting the following:
- The climate science literature did not reference any of the forecasting literature, and there was no indication that they had consulted forecasting experts, even though what they were doing was to quite an extent a forecasting exercise.
- There was only one paper, by Stewart and Glantz, dating back to 1985, that could be described as a forecasting audit, and that paper was critical of the methodology of climate forecasting. And that paper appears to have been cited very little in the coming years.
- Armstrong and Green tried to contact leading climate scientists. Of the few who responded, none listed specific forecasting principles they followed, or reasons for not following general forecasting principles. They pointed to the IPCC reports as the best source for forecasts. Armstrong and Green estimated that the IPCC report violated 72 of 89 forecasting principles they were able to rate (their list of forecasting principles includes 140 principles, but they judged only 127 as applicable to climate forecasting, and were able to rate only 89 of them). No climate scientists responded to their invitation to provide their own ratings for the forecasting principles.
How significant are these general criticisms? It depends on the answers to the following questions:
- In general, how much credence do you assign to the research on forecasting principles, and how strong a prior do you have in favor of these principles being applicable to a specific domain? I think the answer is that forecasting principles as identified on the Forecasting Principles website are a reasonable starting point, and therefore, any major forecasting exercise (or exercise that implicitly generates forecasts) should at any rate justify major points of departure from these principles.
- How representative are the views of Armstrong and Green in the forecasting community? I have no idea about the representativeness of their specific views, but Armstrong in particular is high-status in the forecasting community (that I described a while back) and the Forecasting Principles website is one of the go-to sources, so material on the website is probably not too far from views in the forecasting community. (Note: I asked the question on Quora a while back, but haven't received any answers).
So it seems like there was arguably a failure of proper procedure in the climate science community in terms of consulting and applying practices from relevant domains. Still, how germane was it to the quality of their conclusions? Maybe it didn't matter after all?
In Chapter 12 of The Signal and the Noise, statistician and forecaster Nate Silver offers the following summary of Armstrong and Green's views:
- First, Armstrong and Green contend that agreement among forecasters is not related to accuracy—and may reflect bias as much as anything else. “You don’t vote,” Armstrong told me. “That’s not the way science progresses.”
- Next, they say the complexity of the global warming problem makes forecasting a fool’s errand. “There’s been no case in history where we’ve had a complex thing with lots of variables and lots of uncertainty, where people have been able to make econometric models or any complex models work,” Armstrong told me. “The more complex you make the model the worse the forecast gets.”
- Finally, Armstrong and Green write that the forecasts do not adequately account for the uncertainty intrinsic to the global warming problem. In other words, they are potentially overconfident.
Silver, Nate (2012-09-27). The Signal and the Noise: Why So Many Predictions Fail-but Some Don't (p. 382). Penguin Group US. Kindle Edition.
Silver addresses each of these in his book (read it to know what he says). Here are my own thoughts on the three points as put forth by Silver:
- I think consensus among experts (to the extent that it does exist) should be taken as a positive signal, even if the experts aren't good at forecasting. But certainly, the lack of interest or success in forecasting should dampen the magnitude of the positive signal. We should consider it likely that climate scientists have identified important potential phenomena, but should be skeptical of any actual forecasts derived from their work.
- I disagree somewhat with this point. I think forecasting could still be possible, but as of now, there is little of a successful track record of forecasting (as Green notes in a later draft paper). So forecasting efforts, including simple ones (such as persistence, linear regression, random walk with drift) and ones based on climate models (both the ones in common use right now and others that give more weight to the PDO/AMO), should continue but the jury is still out on the extent to which they work.
- I agree here that many forecasters are potentially overconfident.
Some counterpoints to the Armstrong and Green critique:
- One can argue that what climate scientists are doing isn't forecasting at all, but scenario analysis. After all, the IPCC generates scenarios, but not forecasts. But as I discussed in an earlier post, scenario planning and forecasting are closely related, and even if scenarios aren't direct explicit unconditional forecasts, they often involve implicit conditional forecasts. To its credit, the IPCC does seem to have used some best practices from the scenario planning literature in generating its emissions scenarios. But that is not part of the climate modeling exercise of the IPCC.
- Many other domains that involve planning for the future don't reference the forecasting literature. Examples include scenario planning (discussed here) and the related field of futures studies (discussed here). Insularity of disciplines from each other is a common feature (or bug) in much of academia. Can we really expect or demand that climate scientists hold themselves to a higher standard?
UPDATE: I forgot to mention in my original draft of the post that Armstrong challenged Al Gore to a bet pitting Armstrong's No Change model with the IPCC model. Gore did not accept the bet, but Armstrong created the website (here) anyway to record the relative performance of the two models.
UPDATE 2: Read drnickbone's comment and my replies for more information on the debate. drnickbone in particular points to responses from Real Climate and Skeptical Science, that I discuss in my response to his comment.
Critique #2: Inappropriate or misguided use of statistics, and failure to consult statisticians
To some extent, this overlaps with Critique #1, because best practices in forecasting include good use of statistical methods. However, the critique is a little broader. There are many parts of climate science not directly involved with forecasting, but where statistical methods still matter. Historical climate reconstruction is one such example. The purpose of these is to get a better understanding of the sorts of climate that could occur and have occurred, and how different aspects of the climate correlated. Unfortunately, historical climate data is not very reliable. How do we deal with different proxies for the climate variables we are interested in so that we can reconstruct them? A careful use of statistics is important here.
Let's consider an example that's quite far removed from climate forecasting, but has (perhaps undeservedly) played an important role in the public debate on global warming: Michael Mann's famed hockey stick (Wikipedia), discussed in detail in Mann, Bradley and Hughes (henceforth, MBH98) (available online here). The major critiques of the paper arose in a series of papers by McIntyre and McKitrick, the most important of them being their 2005 paper in Geophysical Research Letters (henceforth, MM05) (available online here).
I read about the controversy in the book The Hockey Stick Illusion by Andrew Montford (Amazon, Wikipedia), but the author also has a shorter article titled Caspar and the Jesus paper that covers the story as it unfolds from his perspective. While there's a lot more to the hockey stick controversy than statistics alone, some of the main issues are statistical.
Unfortunately, I wasn't able to resolve the statistical issues myself well enough to have an informed view. But my very crude intuition, as well as the statements made by statisticians as recorded below, supports Montford's broad outline of the story. I'll try to describe the broad critiques leveled from the statistical perspective:
- Choice of centering and standardization: The data was centered around the 20th century, a method known as short-centering, and bound to create a bias in favor of picking hockey stick-like shapes when doing principal components analysis. Each series was also standardized (divided by the standard deviation for the 20th century), which McIntyre argued was inappropriate.
- Unusual choice of statistic used for significance: MBH98 used a statistic called the RE statistic (reduction of error statistic). This is a fairly unusual statistic to use. In fact, it doesn't have a Wikipedia page, and practically the only stuff on the web (on Google and Google Scholar) about it was in relation to tree-ring research (the proxies used in MBH98 were tree rings). This should seem suspicious: why is tree-ring research using a statistic that's basically unused outside the field? There are good reasons to avoid using statistical constructs on which there is little statistical literature, because people don't have a feel for how they work. MBH98 could have used the R^2 statistic instead, and in fact, they mentioned it in their paper but then ended up not using it.
- Incorrect calculation of significance threshold: MM05 (plus subsequent comments by McIntyre) claims that not only is the RE statistic nonstandard, there were problems with the way MBH98 used it. First off, there is no theoretical distribution of the RE statistic, so calculating the cutoff needed to attain a particular significance level is a tricky exercise (this is one of many reasons why using a RE statistic may be ill-advised, according to McIntyre). MBH98 calculated the cutoff value for 99% significance incorrectly to be 0. The correct value according to McIntyre was about 0.54, whereas the actual RE statistic value for the data set in MBH98 was 0.48, i.e., not close enough. A later paper by Ammann and Wahl, cited by many as a vindication of MBH98, computed a similar cutoff of 0.52, so that the actual RE statistic value failed the significance test. So how did it manage to vindicate MBH98 when the value of the RE statistic failed the cutoff? They appear to have employed a novel statistical procedure, coming up with something called a calibration/verification RE ratio. McIntyre was quite critical of this, for reasons he described in detail here.
There has been a lengthy debate on the subject, plus two external inquiries and reports on the debate: the NAS Panel Report headed by Gerry North, and the Wegman Report headed by Edward Wegman. Both of them agreed with the statistical criticisms made by McIntyre, but the NAS report did not make any broader comments on what this says about the discipline or the general hockey stick hypothesis, while the Wegman report made more explicit criticism.
The Wegman Report made the insularity critique in some detail:
In general, we found MBH98 and MBH99 to be somewhat obscure and incomplete and the criticisms of MM03/05a/05b to be valid and compelling. We also comment that they were attempting to draw attention to the discrepancies in MBH98 and MBH99, and not to do paleoclimatic temperature reconstruction. Normally, one would try to select a calibration dataset that is representative of the entire dataset. The 1902-1995 data is not fully appropriate for calibration and leads to a misuse in principal component analysis. However, the reasons for setting 1902-1995 as the calibration point presented in the
narrative of MBH98 sounds reasonable, and the error may be easily overlooked by someone not trained in statistical methodology. We note that there is no evidence that Dr. Mann or any of the other authors in paleoclimatology studies have had significant interactions with mainstream statisticians.
In our further exploration of the social network of authorships in temperature reconstruction, we found that at least 43 authors have direct ties to Dr. Mann by virtue of coauthored papers with him. Our findings from this analysis suggest that authors in the area of paleoclimate studies are closely connected and thus ‘independent studies’ may not be as independent as they might appear on the surface. This committee does not believe that web logs are an appropriate forum for the scientific debate on this issue.
It is important to note the isolation of the paleoclimate community; even though they rely heavily on statistical methods they do not seem to be interacting with the statistical community. Additionally, we judge that the sharing of research materials, data and results was haphazardly and grudgingly done. In this case we judge that there was too much reliance on peer review, which was not necessarily independent. Moreover, the work has been sufficiently politicized that this community can hardly reassess their public positions without losing credibility. Overall, our committee believes that Mann’s assessments that the decade of the 1990s was the hottest decade of the millennium and that 1998 was the hottest year of the millennium cannot be supported by his analysis.
McIntyre has a lengthy blog post summarizing what he sees as the main parts of the NAS Panel Report, the Wegman Report, and other statements made by statisticians critical of MBH98.
Critique #3: Inadequate use of software engineering, project management, and coding documentation and testing principles
In the aftermath of Climategate, most public attention was drawn to the content of the emails. But apart from the emails, data and code was also leaked, and this gave the world an inside view of the code that's used to simulate the climate. A number of criticisms of the coding practice emerged.
Chicago Boyz had a lengthy post titled Scientists are not Software Engineers that noted the sloppiness in the code, and some of the implications, but was also quick to point out that poor-quality code is not unique to climate science and is a general problem with large-scale projects that arise from small-scale academic research growing beyond what the coders originally intended, but with no systematic efforts being made to refactor the code (if you have thoughts on the general prevalence of good software engineering practices in code for academic research, feel free to share them by answering my Quora question here, and if you have insights on climate science code in particular, answer my Quora question here). Below are some excerpts from the post:
No, the real shocking revelation lies in the computer code and data that were dumped along with the emails. Arguably, these are the most important computer programs in the world. These programs generate the data that is used to create the climate models which purport to show an inevitable catastrophic warming caused by human activity. It is on the basis of these programs that we are supposed to massively reengineer the entire planetary economy and technology base.
The dumped files revealed that those critical programs are complete and utter train wrecks.
The design, production and maintenance of large pieces of software require project management skills greater than those required for large material construction projects. Computer programs are the most complicated pieces of technology ever created. By several orders of magnitude they have more “parts” and more interactions between those parts than any other technology.
Software engineers and software project managers have created procedures for managing that complexity. It begins with seemingly trivial things like style guides that regulate what names programmers can give to attributes of software and the associated datafiles. Then you have version control in which every change to the software is recorded in a database. Programmers have to document absolutely everything they do. Before they write code, there is extensive planning by many people. After the code is written comes the dreaded code review in which other programmers and managers go over the code line by line and look for faults. After the code reaches its semi-complete form, it is handed over to Quality Assurance which is staffed by drooling, befanged, malicious sociopaths who live for nothing more than to take a programmer’s greatest, most elegant code and rip it apart and possibly sexually violate it. (Yes, I’m still bitter.)
Institutions pay for all this oversight and double-checking and programmers tolerate it because it is impossible to create a large, reliable and accurate piece of software without such procedures firmly in place. Software is just too complex to wing it.
Clearly, nothing like these established procedures was used at CRU. Indeed, the code seems to have been written overwhelmingly by just two people (one at a time) over the past 30 years. Neither of these individuals was a formally trained programmer and there appears to have been no project planning or even formal documentation. Indeed, the comments of the second programmer, the hapless “Harry”, as he struggled to understand the work of his predecessor are now being read as a kind of programmer’s Icelandic saga describing a death march through an inexplicable maze of ineptitude and boobytraps.
A lot of the CRU code is clearly composed of hacks. Hacks are informal, off-the-cuff solutions that programmers think up on the spur of the moment to fix some little problem. Sometimes they are so elegant as to be awe inspiring and they enter programming lore. More often, however, they are crude, sloppy and dangerously unreliable. Programmers usually use hacks as a temporary quick solution to a bottleneck problem. The intention is always to come back later and replace the hack with a more well-thought-out and reliable solution, but with no formal project management and time constraints it’s easy to forget to do so. After a time, more code evolves that depends on the existence of the hack, so replacing it becomes a much bigger task than just replacing the initial hack would have been.
(One hack in the CRU software will no doubt become famous. The programmer needed to calculate the distance and overlapping effect between weather monitoring stations. The non-hack way to do so would be to break out the trigonometry and write a planned piece of code to calculate the spatial relationships. Instead, the CRU programmer noticed that that the visualization software that displayed the program’s results already plotted the station’s locations so he sampled individual pixels on the screen and used the color of the pixels between the stations to determine their location and overlap! This is a fragile hack because if the visualization changes the colors it uses, the components that depend on the hack will fail silently.)
For some choice comments excerpted from a code file, see here.
Critique #4: Practices of publication of data, metadata, and code (that had gained traction in other disciplines)
When McIntyre wanted to replicate MBH98, he emailed Mann asking for his data and code. Mann, though initially cooperative, soon started trying to fed McIntyre off. Part of this was because he thought McIntyre was out to find something wrong with his work (a well-grounded suspicion). But part of it was also that his data and code were a mess. He didn't maintain them in a way that he'd be comfortable sharing them around to anybody other than an already sympathetic academic. And, more importantly, as Mann's colleague Stephen Schneider noted, nobody asked for the code and underlying data during peer review. And most journals at the time did not require authors to submit or archive their code and data at the time of submission or acceptance of their paper. This also closely relates to Critique #3: a requirement or expectation that one's data and code would be published along with one's paper might make people more careful to follow good coding practices and avoid using various "tricks" and "hacks" in their code.
Here's how Andrew Montford puts it in The Hockey Stick Illusion:
The Hockey Stick affair is not the first scandal in which important scientific papers underpinning government policy positions have been found to be non-replicable – McCullough and McKitrick review a litany of sorry cases from several different fields – but it does underline the need for a more solid basis on which political decision-making should be based. That basis is replication. Centuries of scientific endeavour have shown that truth emerges only from repeated experimentation and falsification of theories, a process that only begins after publication and can continue for months or years or decades thereafter. Only through actually reproducing the findings of a scientific paper can other researchers be certain that those findings are correct. In the early history of European science, publication of scientific findings in a journal was usually adequate to allow other researchers to replicate them. However, as science has advanced, the techniques used have become steadily more complicated and consequently more difficult to explain. The advent of computers has allowed scientists to add further layers of complexity to their work and to handle much larger datasets, to the extent that a journal article can now, in most cases, no longer be considered a definitive record of a scientific result. There is simply insufficient space in the pages of a print journal to explain what exactly has been done. This has produced a rather profound change in the purpose of a scientific paper. As geophysicist Jon Claerbout puts it, in a world where powerful computers and vast datasets dominate scientific research, the paper ‘is not the scholarship itself, it is merely advertising of the scholarship’.b The actual scholarship is the data and code used to generate the figures presented in the paper and which underpin its claims to uniqueness. In passing we should note the implications of Claerbout’s observations for the assessment for our conclusions in the last section: by using only peer review to assess the climate science literature, the policymaking community is implicitly expecting that a read-through of a partial account of the research performed will be sufficient to identify any errors or other problems with the paper. This is simply not credible. With a full explanation of methodology now often not possible from the text of a paper, replication can usually only be performed if the data and code are available. This is a major change from a hundred years ago, but in the twenty-first century it should be a trivial problem to address. In some specialisms it is just that. We have seen, however, how almost every attempt to obtain data from climatologists is met by a wall of evasion and obfuscation, with journals and funding bodies either unable or unwilling to assist. This is, of course, unethical and unacceptable, particularly for publicly funded scientists. The public has paid for nearly all of this data to be collated and has a right to see it distributed and reused. As the treatment of the Loehle paper shows,c for scientists to open themselves up to criticism by allowing open review and full data access is a profoundly uncomfortable process, but the public is not paying scientists to have comfortable lives; they are paying for rapid advances in science. If data is available, doubts over exactly where the researcher has started from fall away. If computer code is made public too, then the task of replication becomes simpler still and all doubts about the methodology are removed. The debate moves on from foolish and long-winded arguments about what was done (we still have no idea exactly how Mann calculated his confidence intervals) onto the real scientific meat of whether what was done was correct. As we look back over McIntyre’s work on the Hockey Stick, we see that much of his time was wasted on trying to uncover from the obscure wording of Mann’s papers exactly what procedures had been used. Again, we can only state that this is entirely unacceptable for publicly funded science and is unforgiveable in an area of such enormous policy importance. As well as helping scientists to find errors more quickly, replication has other benefits that are not insignificant. David Goodstein of the California Insitute of Technology has commented that the possibility that someone will try to replicate a piece of work is a powerful disincentive to cheating – in other words, it can help to prevent scientific fraud.251 Goodstein also notes that, in reality, very few scientific papers are ever subject to an attempt to replicate them. It is clear from Stephen Schneider’s surprise when asked to obtain the data behind one of Mann’s papers that this criticism extends into the field of climatology.d In a world where pressure from funding agencies and the demands of university careers mean that academics have to publish or perish, precious few resources are free to replicate the work of others. In years gone by, some of the time of PhD students might have been devoted to replicating the work of rival labs, but few students would accept such a menial task in the modern world: they have their own publication records to worry about. It is unforgiveable, therefore, that in paleoclimate circles, the few attempts that have been made at replication have been blocked by all of the parties in a position to do something about it. Medical science is far ahead of the physical sciences in the area of replication. Doug Altman, of Cancer Research UK’s Medical Statistics group, has commented that archiving of data should be mandatory and that a failure to retain data should be treated as research misconduct.252 The introduction of this kind of regime to climatology could have nothing but a salutary effect on its rather tarnished reputation. Other subject areas, however, have found simpler and less confrontational ways to deal with the problem. In areas such as econometrics, which have long suffered from politicisation and fraud, several journals have adopted clear and rigorous policies on archiving of data. At publications such as the American Economic Review, Econometrica and the Journal of Money, Credit and Banking, a manuscript that is submitted for publication will simply not be accepted unless data and fully functional code are available. In other words, if the data and code are not public then the journals will not even consider the article for publication, except in very rare circumstances. This is simple, fair and transparent and works without any dissent. It also avoids any rancorous disagreements between journal and author after the event. Physical science journals are, by and large, far behind the econometricians on this score. While most have adopted one pious policy or another, giving the appearance of transparency on data and code, as we have seen in the unfolding of this story, there has been a near-complete failure to enforce these rules. This failure simply stores up potential problems for the editors: if an author refuses to release his data, the journal is left with an enforcement problem from which it is very difficult to extricate themselves. Their sole potential sanction is to withdraw the paper, but this then merely opens them up to the possibility of expensive lawsuits. It is hardly surprising that in practice such drastic steps are never taken. The failure of climatology journals to enact strict policies or enforce weaker ones represents a serious failure in the system of assurance that taxpayer-funded science is rigorous and reliable. Funding bodies claim that they rely on journals to ensure data availability. Journals want a quiet life and will not face down the academics who are their lifeblood. Will Nature now go back to Mann and threaten to withdraw his paper if he doesn’t produce the code for his confidence interval calculations? It is unlikely in the extreme. Until politicians and journals enforce the sharing of data, the public can gain little assurance that there is any real need for the financial sacrifices they are being asked to accept. Taking steps to assist the process of replication will do much to improve the conduct of climatology and to ensure that its findings are solidly based, but in the case of papers of pivotal importance politicians must also go further. Where a paper like the Hockey Stick appears to be central to a set of policy demands or to the shaping of public opinion, it is not credible for policymakers to stand back and wait for the scientific community to test the veracity of the findings over the years following publication. Replication and falsification are of little use if they happen after policy decisions have been made. The next lesson of the Hockey Stick affair is that if governments are truly to have assurance that climate science is a sound basis for decision-making, they will have to set up a formal process for replicating key papers, one in which the oversight role is peformed by scientists who are genuinely independent and who have no financial interest in the outcome.
Montford, Andrew (2011-06-06). The Hockey Stick Illusion (pp. 379-383). Stacey Arts. Kindle Edition.
View more: Next