Strong Evidence is Common
Portions of this are taken directly from Three Things I've Learned About Bayes' Rule. One time, someone asked me what my name was. I said, “Mark Xu.” Afterward, they probably believed my name was “Mark Xu.” I’m guessing they would have happily accepted a bet at 20:1 odds that my driver’s license would say “Mark Xu” on it. The prior odds that someone’s name is “Mark Xu” are generously 1:1,000,000. Posterior odds of 20:1 implies that the odds ratio of me saying “Mark Xu” is 20,000,000:1, or roughly 24 bits of evidence. That’s a lot of evidence. Seeing a Wikipedia page say “X is the capital of Y” is tremendous evidence that X is the capital of Y. Someone telling you “I can juggle” is massive evidence that they can juggle. Putting an expression into Mathematica and getting Z is enormous evidence that the expression evaluates to Z. Vast odds ratios lurk behind many encounters. One implication of the Efficient Market Hypothesis (EMH) is that is it difficult to make money on the stock market. Generously, maybe only the top 1% of traders will be profitable. How difficult is it to get into the top 1% of traders? To be 50% sure you're in the top 1%, you only need 200:1 evidence. This seemingly large odds ratio might be easy to get. On average, people are overconfident, but 12% aren't. It only takes 50:1 evidence to conclude you are much less overconfident than average. An hour or so of calibration training and the resulting calibration plots might be enough. Running through Bayes’ Rule explicitly might produce a bias towards middling values. Extraordinary claims require extraordinary evidence, but extraordinary evidence might be more common than you think.
I will try to more directly express the positive intuition for why all of this seems possible to me, that is why I think such a loss function over heuristic arguments that makes all the correct tradeoffs should exist.
Consider the process of SGD as a process of Bayesian model selection. We start with some prior of possible weights of a model in some GPT architecture, then we update based on a series of data and in the end we get some model. We might similarly then have a bunch of objections to how such a model selection process could ever learn the data, e.g. that we don't have enough parameters to memorize... (read more)