The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.
But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.
Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.
To that end: what are some examples of discoveries which nobody else was anywhere close to figuring out?
A few tentative examples to kick things off:
- Shannon's information theory. The closest work I know of (notably Nyquist) was 20 years earlier, and had none of the core ideas of the theorems on fungibility of transmission. In the intervening 20 years, it seems nobody else got importantly closer to the core ideas of information theory.
- Einstein's special relativity. Poincaré and Lorentz had the math 20 years earlier IIRC, but nobody understood what the heck that math meant. Einstein brought the interpretation, and it seems nobody else got importantly closer to that interpretation in the intervening two decades.
- Penicillin. Gemini tells me that the antibiotic effects of mold had been noted 30 years earlier, but nobody investigated it as a medicine in all that time.
- Pasteur's work on the germ theory of disease. There had been both speculative theories and scattered empirical results as precedent decades earlier, but Pasteur was the first to bring together the microscope observations, theory, highly compelling empirical results, and successful applications. I don't know of anyone else who was close to putting all the pieces together, despite the obvious prerequisite technology (the microscope) having been available for two centuries by then.
(Feel free to debate any of these, as well as others' examples.)
Clarification: The 'derivation' for how the RLCT predicts generalization error IIRC goes through the same flavour of argument as the one the derivation of the vanilla Bayesian Information Criterion uses. I don't like this derivation very much. See e.g. this one on Wikipedia.
So what it's actually showing is just that:
So if our hypotheses correspond to different function fits (one for each parameter configuration, meaning we'd have 232k hypotheses if our function fits used k 32-bit floating point numbers), the chance we put on any one of the function fits being correct will be tiny. So having more parameters is bad, because the way we picked our prior means our belief in any one hypothesis goes to zero as N goes to infinity.
So the Wikipedia derivation for the original vanilla posterior of model selection is telling us that having lots of parameters is bad, because it means we're spreading our prior around exponentially many hypotheses.... if we have the sort of prior that says all the hypotheses are about equally likely.
But that's an insane prior to have! We only have 1.0 worth of probability to go around, and there's an infinite number of different hypotheses. Which is why you're supposed to assign prior based on K-complexity, or at least something that doesn't go to zero as the number of hypotheses goes to infinity. The derivation is just showing us how things go bad if we don’t do that.
In summary: badly normalised priors behave badly
SLT mostly just generalises this derivation to the case where parameter configurations in our function fits don't line up one-to-one with hypotheses.
It tells us that if we are spreading our prior around evenly over lots of parameter configurations, but exponentially many of these parameter configurations are secretly just re-expressing the same hypothesis, then that hypothesis can actually get a decent amount of prior, even if the total number of parameter configurations is exponentially large.
So our prior over hypotheses in that case is actually somewhat well-behaved in that it can end up normalised properly when we take N→∞. That is a basic requirement a sane prior needs to have, so we're at least not completely shooting ourselves in the foot anymore. But that still doesn't show why this prior, that neural networks sort of[1] implicitly have, is actually good. Just that it's no longer obviously wrong in this specific way.
Why does this prior apparently make decent-ish predictions in practice? That is, why do neural networks generalise well?
I dunno. SLT doesn't say. It just tells us how the parameter prior to hypothesis prior conversion ratio works, and in the process shows us that neural networks priors can be at least somewhat sanely normalised for large numbers of parameters. More than we might have initially thought at least.
That's all though. It doesn't tell us anything else about what makes a Gaussian over transformer parameter configurations a good starting guess for how the universe works.
How to make this story tighter?
If people aim to make further headway on the question of why some function fits generalise somewhat and others don't, beyond: 'Well, standard Bayesianism suggests you should at least normalise your prior so that having more hypotheses isn't actively bad', then I'd suggest a starting point might be to make a different derivation for the posterior on the fits that isn't trying to reason about p(M) defined as the probability that one of the function fits is 'true' in the sense of exactly predicting the data. Of course none of them are. We know that. When we fit a 150 billion parameter transformer to internet data, we don't expect going in that any of these 216×150×109 parameter configurations will give zero loss up to quantum noise on any and all text prediction tasks in the universe until the end of time. Under that definition of M, which the SLT derivation of the posterior and most other derivations of this sort I've seen seem to implicitly make, we basically have p(M)≈0 going in! Maybe look at the Bayesian posterior for a set of hypotheses we actually believe in at all before we even see any data, like M='one of these models might get <1.1 average loss on holdout data sets' .
SLT in three sentences
'You thought your choice of prior was broken because it's nor normalised right, and so goes to zero if you hand it too many hypotheses. But you missed that the way you count your hypotheses is also broken, and the two mistakes sort of cancel out. Also here's a bunch of algebraic geometry that sort of helps you figure out what probabilities your weirdo prior actually assigns to hypotheses, though that parts not really finished'.
SLT in one sentence
'Loss basins with bigger volume will have more posterior probability if you start with a uniform-ish prior over parameters, because then bigger volumes get more prior, duh.'
Sorta, kind of, arguably. There's some stuff left to work out here. For example vanilla SLT doesn't even actually tell you which parts of your posterior over parameters are part of the same hypothesis. It just sort of assumes that everything left with support in the posterior after training is part of the same hypothesis, even though some of these parameter settings might generalise totally differently outside the training data. My guess is that you can avoid matching this up by comparing equivalence over all possible inputs by checking which parameter settings give the same hidden representations over the training data, not just the same outputs.