I am not aware of Savage much apart from both Bayesian and Frequentists not liking him. And I did not follow Jaynes math fully and there are some papers going back and forth on some of his assumptions, so the mathematical underpinnings may not be as strong as we would like.
I don't know, Intuitively you should be able to ground the agent stuff in information theory, because the rules they put forwards are the same, Jaynes also has a chapter on decision theory where he makes the wonderful point that the utility function is way more arbitrary than a prior, so you might as well be Bayesian if you are into inventing ad hoc functions anyway.
Ahh, I know that is a first year course for most math students, but only math students take that class :), I have never read an analysis book :), I took the applied path and read 3 other bayesian books before this one, so I taught the math in this books were simultaneously very tedious and basic :)
If anyone relies on tags to find posts, and you feel this post is missing a tag, then "Tag suggestions" will be much appreciated
That surprising to me, I think you can read the book two ways, 1) you skim the math, enjoy the philosophy and take his word that the math says what he says it says 2) you try to understand the math, if you take 2) then you need to at least know the chain rule of integration and what a delta dirac function is, which seems like high level math concepts to me, full disclaimer I am a biochemist by training, so I have also read it without the prerequisite formal training. I think you are right that if you ignore chapter 2 and a few sections about partition functions and such then the math level for the other 80% is undergraduate level math
crap, you are right, this was one of the last things we changed before publishing because out previous example were to combative :(.
I will fix it later today.
I think this is a pedagogical Version of Andrew Gelmans shrinkage Triology
The most important paper also has a blog post, The very short version is if you z score the published effects, then then you can derive a prior for the 20.000+ effects from the Cochrane database. A Cauchy distribution fits very well. The Cauchy distribution has very fat tails, so you should regress small effects heavily towards the null and regress very large effects very little.
Here is a fun figure of the effects, Medline is published stuff, so no effects between -2 and 2 as they wo...
SR if you can only read one, if you do not expect to do fancy things then ROS may be better as it is very good and explains the basics better. The logic of Science should be your 5th book and is good goal to set, The logic of Science is probably the rationalist bible, much like the real bible everybody swears by it but nobody has read or understood it :)
Thanks for the reply, 3 seams very automatable, record all text before the image, if that's 4 minuts then then put the image in after 4 min. But i totally get that stuff is more complicated than it initially seems, keep up the good work!
I agree tails are important, but for callibration few of your predictions should land in the tail, so imo you should focus on getting the trunk of the distribution right first, and the later learn to do overdispersed predictions, there is no closed form solution to callibration for a t distribution, but there is for a normal, so for pedagogical reasons I am biting the bullet and asuming the normal is correct :), part 10 in this series 3 years in the future may be some black magic of the posterior of your t predictions using HMC to approximate the 2d posterior of sigma and nu ;), and then you can complain "but what about skewed distributios" :P
The text to speech is phenomenal!, Only math and tables suck
Suggestions for future iterations:
It would be nice if you wrote a short paragraph for each link, "requires download", "questions are from 2011", or you sorted the list somehow :)
Yes, You can change future by being smarter and future by being better calibrated, my rule assumes you don't get smarter and therefore have to adjust only future .
If you actually get better at prediction you could argue you would need to update less than the RMSE estimate suggests :)
I agree with both points
If you are new to continuous predictions then you should focus on the 50% Interval as it gives you most information about your calibration, If you are skilled and use for example a t-distribution then you have for the trunk and for the tail, even then few predictions should land in the tails, so most data should provide more information about how to adjust , than how to adjust
Hot take: I think the focus 95% is an artifact of us focusing on p<0.05 in frequentest statistics.
Our ability to talk past each other is impressive :)
would have been an easier way to illustrate your point). I think this is actually the assumption you're making. [Which is a horrible assumption, because if it were true, you would already be perfectly calibrated].
Yes this is almost the assumption I am making, the general point of this post is to assume that all your predictions follow a Normal distribution, with as "guessed" and with a that is different from what you guessed, and then use to get a point estimate for the counterfactual you sho...
Thanks!, I am planing on writing a few more in this vein, currently I have some rough drafts of:
I can't promise they will be as good as this ...
Yes you are right, but under the assumption the errors are normal distributed, then I am right:
If:
Then Which is much less than 1.
proof:
import scipy as sp
x1 = sp.stats.norm(0, 0.5).rvs(22 * 10000)
x2 = sp.stats.norm(0, 1.1).rvs(78 * 10000)
x12 = pd.Series(np.array(x1.tolist() + x2.tolist()))
print((x12 ** 2).median())
I am making the simple observation that the median error is less than one because the mean squares error is one.
That's also how I conseptiolize it, you have to change your intervals because you are to stupid to make better predictions, if the prediction was always spot on then sigma should be 0 and then my scheme does not make sense
If you suck like me and get a prediction very close then I would probably say: that sometimes happen :) note I assume the average squared error should be 1, which means most errors are less than 1, because 02+22=2>1
I agree, most things is not normal distributed and my callibrations rule answers how to rescale to a normal. Metaculus uses the cdf of the predicted distribution which is better If you have lots of predictions, my scheme gives an actionable number faster, by making assumptions that are wrong, but if you like me have intervals that seems off by almost a a factor of 2, then your problem is not the tails but the entire region :), so the trade of seems worth it.
Agreed, More importantly the two distribution have different kurtosis, so their tails are very different a few sigmas away
I do think the Laplace distribution is a better beginner distribution because of its fat tails, but advocating for people to use a distribution they have never heard of seems like a to tough sell :)
My original opening statement got trashed for being to self congratulatory, so the current one is a hot fix :), So I agree with you!
Me to, I learned about this from another disease and taught, that's probably how it works for colorblindness as well.
I would love you as a reviewer of my second post as there I will try to justify why I think this approach is better, you can even super dislike it before I publish if you still feel like that when I present my strongest arguments, or maybe convince me that I am wrong so I dont publish part 2 and make a partial retraction for this post :). There is a decent chance you are right as you are the stronger predictor of the two of us :)
Can I use this image for my "part 2" posts, to explain how "pros" calibrate their continuous predictions?, And how it stacks up against my approach?, I will add you as a reviewer before publishing so you can make corrections in case I accidentally straw man or misunderstand you :)
I will probably also make a part 3 titled "Try t predictions" :), that should address some of your other critiques about the normal being bad :)
Note 1 for JenniferRM: I have updated the text so it should alleviate your confusion, if you have time, try to re-read the post before reading the rest of my comment, hopefully the few changes should be enough to answer why we want RMSE=1 and not 0.
Note 2 for JenniferRM and others who share her confusion: if the updated post is not sufficient but the below text is, how do I make my point clear without making the post much longer?
With binary predictions you can cheat and predict 50/50 as you point out... You can't cheat with continuous predictions as ther...
The big ask is making normal predictions, calibrating them can be done automatically here is a quick example using google sheets: here is an example
I totally agree with both your points, This comment From a Metaculus user have some good objections to "us" :)
I am sorry if I have straw manned you, and I think your above post is generally correct. I think we are cumming from two different worlds.
You are coming from Metaculus where people make a lot of predictions. Where having 50+ predictions is the norm and the thus looking at a U(0, 1) gives a lot of intuitive evidence of calibration.
I come from a world where people want to improve in all kids of ways, and one of them is prediction, few people write more than 20 predictions down a year, and when they do they more or less ALWAYS make dichotomous predictions. I ...
TLDR for our disagreement:
SimonM: Transforming to Uniform distribution works for any continuous variable and is what Metaculus uses for calibration
Me: the variance trick to calculate from this post is better if your variables are form a Normal distribution, or something close to a normal.
SimonM: Even for a Normal the Uniform is better.
I don't know what s.f is, but the interval around 1.73 is obviously huge, with 5-1-0 data points it's quite narrow if your predictions are drawn from N(1, 1.73), that is what my next post will be about. There might also be a smart way to do this using the Uniform, but I would be surprised if it's dispersion is smaller than a chi^2 distribution :) (changing the mean is cheating, we are talking about calibration, so you can only change your dispersion)
Hard disagree, From two data points I calculate that my future intervals should be 1.73 times wider, converting these two data points to U(0,1) I get
[0.99, 0.25]
How should I update my future predictions now?
you are missing the step where I am transforming arbitrary distribution to U(0, 1)
medium confident in this explanation: Because the square of random variables from the same distributions follows a gamma distribution, and it's easier to see violations from a gamma than from a uniform, If the majority of your predictions are from a weird distributions then you are correct, but if they are mostly from normal or unimodal ones, then I am right. I agree that my solution is a hack that would make no statistician proud :)
Edit: Intuition pump, a T(0, 1, 100) obviou...
changed to "Making predictions is a good practice, writing them down is even better."
does anyone have a better way of introducing this post?
(Edit: the above post has 10 up votes, so many people feel like that, so I will change the intro)
You have two critiques:
Scott Alexander evokes tribalism
We predict more than people outside our group holding everything else constant
I was not aware of it, and I will change if more than 40% agree
Remove reference to Scott Alexander from the intro: [poll]{Agree}{Disagree}
Also Remove "We rationalists...
This is a good point, but you need less data to check whether your squared errors are close to 1 than whether your inverse CDF look uniform, so if the majority of predictions are normal I think my approach is better.
The main advantage of SimonM/Metaculus is that it works for any continuous distribution.
you need less data to check whether your squared errors are close to 1 than whether your inverse CDF look uniform
I don't understand why you think that's true. To rephrase what you've written:
"You need less data to check whether samples are approximately N(0,1) than if they are approximately U(0,1)"
It seems especially strange when you think that transforming your U(0,1) samples to N(0,1) makes the problem soluble.
Agreed 100% on 1) and with 2) I think my point is "start using the normal predictions as a gate way drug to over dispersed and model based predictions"
I stole the idea from Gelman and simplified it for the general community, I am mostly trying to raise the sanity waterline by spreading the gospel of predicting on the scale of the observed data. All your critiques of normal forecasts are spot on.
Ideally everybody would use mixtures of over-dispersed distributions or models when making predictions to capture all sources of uncertainty
It is my hope that by ed...
You could make predictions from a t distribution to get fatter tails, but then the "easy math" for calibration becomes more scary... You can then take the "quartile" from the t distribution and ask what sigma in the normal that corresponds to. That is what I outlined/hinted at in the "Advanced Techniques 3"
Good Points, Everything is a conditional probability, so you can simply make conditional normal predictions:
Let A = Biden alive
Let B = Biden vote share
Then the normal probability is conditional on him being alive and does not count otherwise :)
Another solution is to make predictions from a T-distribution to get fatter tails. and then use "Advanced trick 3" to transform it back to a normal when calculating your calibration.
I think this was by parents, so they are forgiven :), your story is pretty crazy, but there is so much to know as a doctor that most becomes rules of thumbs (maps vs buttons) untill called out like you did
fair point. I think my target audience is people like me who heard this saying about colorblindness (or other classical Mendelian diseases that runs in families)
I have added a disclaimer towards the end :)
I am not sure I follow, I am confused about whether the 60/80 family refers to both parents, and what is meant by "off-beat" and "snap-back", I am also confused about what the numbers mean is it 60/80 of the genes or 60/80 of the coding region (so only 40 genes)
I totally agree, technically it's a correct observation, but it's also what I was taught by adults when I asked as a kid, and therefore I wanted to correct it as the real explanation is very short and concise.
That is hard to believe, you seem so smart at the UoB discord and your podcast :), thanks for sharing
The University of Bayes Discord (UoB) has study groups for Bayesian statistics which might be relevant to you. The newest study group is doing Statistical Rethinking 2022 as the lectures get posted to YouTube. It requires less math than you have demonstrated in your post.
If you want a slightly more rigors path to Bayesian statistics, then I would advice to read Lambert or Gelman See here for more info.
If you want to take the mathematician approach and lean probability theory first, then the book Probability 110 by Blitzstein is pretty good, the study group...
Totally agree, it's also Christians critique of the idea :)... Maybe it could be relevant for aliens on a smaller planet as they could leave their planet more easily, and would thus be less advanced than us when we become space faring :)... Or a scifi where the different tech trees progress different, like stram punk
Then maybe it only work for harem anime in space :)
A lot of your latex is not rendered correctly...
Agreed, but then you don't get cool space amazons :). It could be an extra fail safe mechanism :)
Good Point, In principle the X chromosome already has this issue when you get it from your farther, if the A chromosome is simply a normal X chromosome with an insertion of a set of proteins that blocks silencing, then you can still have recombination, if we assume the Amazon proteins are all located in the same LD region then mechanically everything is as in the post, but we do not have the Muller's ratchet problem
Also the A only recombines with X as AY is female and therefore never mates with an AX or AY
When the space ship lands there is a 1% chance that no males are among the first 16 births ()
Luckily males are firtile for longer so if the second generation had no men the first generation still works
If the A had a mutation such that AX did not have 50% chance of passing on a A, then the gender ratio would be even more extreme, if the last man dies the a AY female could probably artificially incriminate a female.
You can update the matrix and do the for product to see how those different rules pan out, if you have a specific ratio you want to try then ...
This might help you https://github.com/MaksimIM/JaynesProbabilityTheory
But to be honest I did very few of the exercises, from chapter 4 and onward most of the stuff Jayne says are "over complicated" in the sense that he derives some fancy function, but that is actually just the poison likelihood or whatever, so as long as you can follow the math sufficiently to get a feel for what the text says, then you can enjoy that all of statistics is derivable from his axioms, but you don't have to be able to derive it yourself, and if you ever want to do actual Baye... (read more)