VincentYu

Wiki Contributions

Comments

Sorted by

Thanks for writing such a comprehensive explanation!

Why is downvoting disabled, for how long has it been like this, and when will it be back?

In support of your point, MIRI itself changed (in the opposite direction) from its former stance on AI research.

You've been around long enough to know this, but for others: The former ambition of MIRI in the early 2000s—back when it was called the SIAI—was to create artificial superintelligence, but that ambition changed to ensuring AI friendliness after considering the "terrible consequences [now] feared by the likes of MIRI".

In the words of Zack_M_Davis 6 years ago:

(Disclaimer: I don't speak for SingInst, nor am I presently affiliated with them.)

But recall that the old name was "Singularity Institute for Artificial Intelligence," chosen before the inherent dangers of AI were understood. The unambiguous for is no longer appropriate, and "Singularity Institute about Artificial Intelligence" might seem awkward.

I seem to remember someone saying back in 2008 that the organization should rebrand as the "Singularity Institute For or Against Artificial Intelligence Depending on Which Seems to Be a Better Idea Upon Due Consideration," but obviously that was only a joke.

I've always thought it's a shame they picked the name MIRI over SIFAAIDWSBBIUDC.

  • who on lesswrong tracks their predictions outside of predictionbook, and their thoughts on that method

Just adding to the other responses: I also use Metaculus and like it a lot. In another thread, I posted a rough note about its community's calibration.

Compared to PredictionBook, the major limitation of Metaculus is that users cannot create and predict on arbitrary questions, because questions are curated. This is an inherent limitation/feature for a website like Metaculus because they want the community to focus on a set of questions of general interest. In Metaculus's case, 'general interest' translates mostly to 'science and technology'; for questions on politics, I suggest taking a look at GJ Open instead.

Here is the full text article that was actually published by Kahneman et al. (2011) in Harvard Business Review, and here is the figure that was in HBR:

Is there any information on how well-calibrated the community predictions are on Metaculus?

Great question! Yes. There was a post on the official Metaculus blog that addressed this, though this was back in Oct 2016. In the past, they've also sent to subscribed users a few emails that looked at community calibration.

I've actually done my own analysis on this around two months ago, in private communication. Let me just copy two of the plots I created and what I said there. You might want to ignore the plots and details, and just skip to the "brief summary" at the end.

(Questions on Metaculus go through an 'open' phase then a 'closed' phase; predictions can only be made and updated while the question is open. After a question closes, it gets resolved either positive or negative once the outcome is known. I based my analysis on the 71 questions that have been resolved as of 2 months ago; there are around 100 resolved questions now.)

First, here's a plot for the 71 final median predictions. The elements of this plot:

  • Of all monotonic functions, the black line is the one that, when applied to this set of median predictions, performs the best (in mean score) under every proper scoring rule given the realized outcomes. This can be interpreted as a histogram with adaptive bin widths. So for instance, the figure shows that, binned together, predictions from 14% to 45% resolved positive around 0.11 of the time. This is also the maximum-likelihood monotonic function.

  • The confidence bands are for the null hypothesis that the 71 predictions are all perfectly calibrated and independent, so that we can sample the distribution of counterfactual outcomes simply by treating the outcome of each prediction with credence p as an independent coin flip with probability p of positive resolution. I sampled 80,000 sets of these 71 outcomes, and built the confidence bands by computing the corresponding maximum-likelihood monotonic function for each set. The inner band is pointwise 1 sigma, whereas the outer is familywise 2 sigma. So the corner of the black line that exceeds the outer band around predictions of 45% is a p < 0.05 event under perfect calibration, and it looks to me that predictions around 30% to 40% are miscalibrated (underconfident).

  • The two rows of tick marks below the x-axis show the 71 predictions, with the upper green row comprising positive resolutions, and the lower red row comprising negatives.

  • The dotted blue line is a rough estimate of the proportion of questions resolving positive along the range of predictions, based on kernel density estimates of the distributions of predictions giving positive and negative resolutions.

Now, a plot of all 3723 final predictions on the 71 questions.

  • The black line is again the monotonic function that minimizes mean proper score, but with the 1% and 99% predictions removed because—as I expected—they were especially miscalibrated (overconfident) compared to nearby predictions.

  • The two black dots indicate the proportion of question resolving positive for 1% and 99% predictions (around 0.4 and 0.8).

  • I don't have any bands indicating dispersion here because these predictions are a correlated mess that I can't deal with. But for predictions below 20%, the deviation from the diagonal looks large enough that I think it shows miscalibration (overconfidence).

  • Along the x-axis I've plotted kernel density estimates of the predictions resolving positive (green, solid line) and negative (red, dotted line). Kernel densities were computed under log-odds with Gaussian kernels, then converted back to probabilities in [0, 1].

  • The blue dotted line is again a rough estimate of the proportion resolving positive, using these two density estimates.

Brief summary:

  • Median predictions around 30% to 40% occur less often than claimed.
  • User predictions below around 20% occur more often than claimed.
  • User predictions at 1% and 99% are obviously overconfident.
  • Other than these, calibration seems okay everywhere else; at least, they aren't obviously off.
  • I'm very surprised that user predictions look fairly accurate around 90% and 95% (resolving positive around 0.85 and 0.90 of the time). I expected strong overconfidence like that shown by the predictions below 20%.

Also, if one wanted to get into it, could you describe what your process is?

Is there anything in particular that you want to hear about? Or would you rather have a general description of 1) how I'd suggest starting out on Metaculus, and/or 2) how I approach making and updating predictions on the site, and/or 3) something else?

(The FAQ is handy for questions about the site. It's linked to by the 'help' button at the button of every page.)

That's some neat data and observation! Could there be other substantial moderating differences between the days when you generate ~900 kJ and the days when you don't? (E.g., does your mental state before you ride affect how much energy you generate? This could suggest a different causal relationship.) If there are, maybe some of these effects can be removed if you independently randomize the energy you generate each time you ride, so that you don't get to choose how much you ride.

To make this a single-blinded experiment, just wear a blindfold; to double blind, add a high-beam lamp to your bike; and to triple blind, equip and direct high beams both front and rear.

… okay, there will be no blinding.

Polled.

  1. I generally do only a quick skim of post titles and open threads (edit: maybe twice a month on average; I'll try visiting more often). I used to check LW compulsively prior to 2013, but now I think both LW and I have changed a lot and diverged from each other. No hard feelings, though.

  2. I rarely click link posts on LW. I seldom find them interesting, but I don't mind them as long as other LWers like them.

  3. I mostly check LW through a desktop browser. Back in 2011–2012, I used Wei Dai's "Power Reader" script to read all comments. I also used to rely on Dbaupp's "scroll to new comments" script after they posted it in 2011, but these days I use Bakkot's "comment highlight" script. (Thanks to all three of you!)

  4. I've been on Metaculus a lot over the past year. It's a prediction website focusing on science and tech (the site's been mentioned a few times on LW, and in fact that's how I heard of it). It's sort of like a gamified and moderated PredictionBook. (Edit: It's also similar to GJ Open, but IMO, Metaculus has way better questions and scoring.) It's a more-work-less-talk kind of website, so it's definitely not a site for general discussions.

    I've been meaning to write an introductory post about Metaculus… I'll get to that sometime.

    Given that one of LW's past focus was on biases, heuristics, and the Bayesian interpretation of probability, I think some of you might find it worthwhile and fun to do some real-world practice on manipulating subjective probabilities based on finding evidence. Metaculus is all about that sort of stuff, so join us! (My username there is 'v'. I recognize a few of you, especially WhySpace, over there.) The site itself is under continual improvement and work, and I know that the admins have high ambitions for it.

Edit: By the way, this is a great post and idea. Thanks!

I haven't been around for a while, but I expect to start fulfilling the backlog of requests after Christmas. Sorry for the long wait.

Do we know which country Wright was living in during 2010?

Load More