LESSWRONG
LW

Comment Permalink

The RLCT = first-order term for in-distribution generalization error

Clarification: The 'derivation' for how the RLCT predicts generalization error IIRC goes through the same flavour of argument as the one the derivation of the vanilla Bayesian Information Criterion uses. I don't like this derivation very much. See e.g. this one on Wikipedia.

So what it's actually showing is just that:

If you've got a class of different hypotheses $M$ , containing many individual hypotheses ${θ_{1}, θ_{2}, \dots θ_{N}}$ .
And you've got a prior ahead of time that says the chance any one of the hypotheses in $M$ is true is some number $p (M) < 1$ ., let's say it's $p (M) = 0.8$ as an example.
And you distribute this total probability $p (M) = 0.8$ around the different hypotheses in an even-ish way, so $p (θ_{i}, M) \propto \frac{1}{N}$ , roughly.
And then you encounter a bunch of data $X$ (the training data) and find that only one or a tiny handful of hypotheses in $M$ fit that data, so $p (X | θ_{i}, M) \neq 0$ for basically only one hypotheses $θ_{i}$ ...
Then your posterior probability $p (M | X) = \frac{p (X | M) 0.8}{0.8 p (X | M) + 0.2 p (X | \neg M)}$ that the hypothesis $θ_{i}$ is correct will probably be tiny, scaling with $\frac{1}{N}$ . If we spread your prior $p (M) = 0.8$ over lots of hypotheses, there isn't a whole lot of prior to go around for any single hypothesis. So if you then encounter data that discredits all hypotheses in M except one, that tiny bit of spread-out prior for that one hypothesis will make up a tiny fraction of the posterior, unless $p (X | \neg M)$ is really small, i.e. no hypothesis outside the set $M$ can explain the data either.

So if our hypotheses correspond to different function fits (one for each parameter configuration, meaning we'd have $2^{32 k}$ hypotheses if our function fits used $k$ $32$ -bit floating point numbers), the chance we put on any one of the function fits being correct will be tiny. So having more parameters is bad, because the way we picked our prior means our belief in any one hypothesis goes to zero as $N$ goes to infinity.

So the Wikipedia derivation for the original vanilla posterior of model selection is telling us that having lots of parameters is bad, because it means we're spreading our prior around exponentially many hypotheses.... if we have the sort of prior that says all the hypotheses are about equally likely.

But that's an insane prior to have! We only have $1.0$ worth of probability to go around, and there's an infinite number of different hypotheses. Which is why you're supposed to assign prior based on K-complexity, or at least something that doesn't go to zero as the number of hypotheses goes to infinity. The derivation is just showing us how things go bad if we don’t do that.

In summary: badly normalised priors behave badly

SLT mostly just generalises this derivation to the case where parameter configurations in our function fits don't line up one-to-one with hypotheses.

It tells us that if we are spreading our prior around evenly over lots of parameter configurations, but exponentially many of these parameter configurations are secretly just re-expressing the same hypothesis, then that hypothesis can actually get a decent amount of prior, even if the total number of parameter configurations is exponentially large.

So our prior over hypotheses in that case is actually somewhat well-behaved in that it can end up normalised properly when we take $N \to \infty$ . That is a basic requirement a sane prior needs to have, so we're at least not completely shooting ourselves in the foot anymore. But that still doesn't show why this prior, that neural networks sort of^[1] implicitly have, is actually good. Just that it's no longer obviously wrong in this specific way.

Why does this prior apparently make decent-ish predictions in practice? That is, why do neural networks generalise well?

I dunno. SLT doesn't say. It just tells us how the parameter prior to hypothesis prior conversion ratio works, and in the process shows us that neural networks priors can be at least somewhat sanely normalised for large numbers of parameters. More than we might have initially thought at least.

That's all though. It doesn't tell us anything else about what makes a Gaussian over transformer parameter configurations a good starting guess for how the universe works.

How to make this story tighter?

If people aim to make further headway on the question of why some function fits generalise somewhat and others don't, beyond: 'Well, standard Bayesianism suggests you should at least normalise your prior so that having more hypotheses isn't actively bad', then I'd suggest a starting point might be to make a different derivation for the posterior on the fits that isn't trying to reason about $p (M)$ defined as the probability that one of the function fits is 'true' in the sense of exactly predicting the data. Of course none of them are. We know that. When we fit a $150$ billion parameter transformer to internet data, we don't expect going in that any of these $2^{16 \times 150 \times 10^{9}}$ parameter configurations will give zero loss up to quantum noise on any and all text prediction tasks in the universe until the end of time. Under that definition of $M$ , which the SLT derivation of the posterior and most other derivations of this sort I've seen seem to implicitly make, we basically have $p (M) \approx 0$ going in! Maybe look at the Bayesian posterior for a set of hypotheses we actually believe in at all before we even see any data, like $M ='one of these models might get < 1.1 average loss on holdout data sets'$ .

SLT in three sentences

'You thought your choice of prior was broken because it's nor normalised right, and so goes to zero if you hand it too many hypotheses. But you missed that the way you count your hypotheses is also broken, and the two mistakes sort of cancel out. Also here's a bunch of algebraic geometry that sort of helps you figure out what probabilities your weirdo prior actually assigns to hypotheses, though that parts not really finished'.

SLT in one sentence

'Loss basins with bigger volume will have more posterior probability if you start with a uniform-ish prior over parameters, because then bigger volumes get more prior, duh.'

^{^}
Sorta, kind of, arguably. There's some stuff left to work out here. For example vanilla SLT doesn't even actually tell you which parts of your posterior over parameters are part of the same hypothesis. It just sort of assumes that everything left with support in the posterior after training is part of the same hypothesis, even though some of these parameter settings might generalise totally differently outside the training data. My guess is that you can avoid matching this up by comparing equivalence over all possible inputs by checking which parameter settings give the same hidden representations over the training data, not just the same outputs.

See in context

194

[ Question ]

Examples of Highly Counterfactual Discoveries?

by johnswentworth

23rd Apr 2024

1 min read

25 102

194

The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.

Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.

To that end: what are some examples of discoveries which nobody else was anywhere close to figuring out?

A few tentative examples to kick things off:

Shannon's information theory. The closest work I know of (notably Nyquist) was 20 years earlier, and had none of the core ideas of the theorems on fungibility of transmission. In the intervening 20 years, it seems nobody else got importantly closer to the core ideas of information theory.
Einstein's special relativity. Poincaré and Lorentz had the math 20 years earlier IIRC, but nobody understood what the heck that math meant. Einstein brought the interpretation, and it seems nobody else got importantly closer to that interpretation in the intervening two decades.
Penicillin. Gemini tells me that the antibiotic effects of mold had been noted 30 years earlier, but nobody investigated it as a medicine in all that time.

Pasteur's work on the germ theory of disease. There had been both speculative theories and scattered empirical results as precedent decades earlier, but Pasteur was the first to bring together the microscope observations, theory, highly compelling empirical results, and successful applications. I don't know of anyone else who was close to putting all the pieces together, despite the obvious prerequisite technology (the microscope) having been available for two centuries by then.

(Feel free to debate any of these, as well as others' examples.)

World Modeling

Frontpage

194

Mentioned in

20Towards the Operationalization of Philosophy & Wisdom

10Looking beyond Everett in multiversal views of LLMs

Examples of Highly Counterfactual Discoveries?

8Alexander Gietelink Oldenziel

4tailcalled

12Alexander Gietelink Oldenziel

11Lucius Bushnaq

6Alexander Gietelink Oldenziel

6mattmacdermott

5Lucius Bushnaq

2tailcalled

4Alexander Gietelink Oldenziel

4Lucius Bushnaq

2Algon

4tailcalled

23Alexander Gietelink Oldenziel

4cousin_it

4abramdemski

23CronoDAS

20Alexander Gietelink Oldenziel

5Alexander Gietelink Oldenziel

9Alexander Gietelink Oldenziel

3Alexander Gietelink Oldenziel

11Alexander Gietelink Oldenziel

9johnswentworth

3Alexander Gietelink Oldenziel

5Garrett Baker

4Jiro

3transhumanist_atom_understander

14Alexander Gietelink Oldenziel

5ChrisHibbert

5transhumanist_atom_understander

5Niclas Kupper

6Alexander Gietelink Oldenziel

New Answer

New Comment

25 Answers sorted by
top scoring

kromem

Apr 23, 2024

12031

Lucretius in De Rerum Natura in 50 BCE seemed to have a few that were just a bit ahead of everyone else.

Survival of the fittest (book 5):

"In the beginning, there were many freaks. Earth undertook Experiments - bizarrely put together, weird of look Hermaphrodites, partaking of both sexes, but neither; some Bereft of feet, or orphaned of their hands, and others dumb, Being devoid of mouth; and others yet, with no eyes, blind. Some had their limbs stuck to the body, tightly in a bind, And couldn't do anything, or move, and so could not evade Harm, or forage for bare necessities. And the Earth made Other kinds of monsters too, but in vain, since with each, Nature frowned upon their growth; they were not able to reach The flowering of adulthood, nor find food on which to feed, Nor be joined in the act of Venus.

For all creatures need Many different things, we realize, to multiply And to forge out the links of generations: a supply Of food, first, and a means for the engendering seed to flow Throughout the body and out of the lax limbs; and also so The female and the male can mate, a means they can employ In order to impart and to receive their mutual joy.

Then, many kinds of creatures must have vanished with no trace Because they could not reproduce or hammer out their race. For any beast you look upon that drinks life-giving air, Has either wits, or bravery, or fleetness of foot to spare, Ensuring its survival from its genesis to now."

Trait inheritance from both parents that could skip generations (book 4):

"Sometimes children take after their grandparents instead, Or great-grandparents, bringing back the features of the dead. This is since parents carry elemental seeds inside – Many and various, mingled many ways – their bodies hide Seeds that are handed, parent to child, all down the family tree. Venus draws features from these out of her shifting lottery – Bringing back an ancestor’s look or voice or hair. Indeed These characteristics are just as much the result of certain seed As are our faces, limbs and bodies. Females can arise From the paternal seed, just as the male offspring, likewise, Can be created from the mother’s flesh. For to comprise A child requires a doubled seed – from father and from mother. And if the child resembles one more closely than the other, That parent gave the greater share – which you can plainly see Whichever gender – male or female – that the child may be."

Objects of different weights will fall at the same rate in a vacuum (book 2):

“Whatever falls through water or thin air, the rate Of speed at which it falls must be related to its weight, Because the substance of water and the nature of thin air Do not resist all objects equally, but give way faster To heavier objects, overcome, while on the other hand Empty void cannot at any part or time withstand Any object, but it must continually heed Its nature and give way, so all things fall at equal speed, Even though of differing weights, through the still void.”

Often I see people dismiss the things the Epicureans got right with an appeal to their lack of the scientific method, which has always seemed a bit backwards to me. In hindsight, they nailed so many huge topics that didn't end up emerging again for millennia that it was surely not mere chance, and the fact that they successfully hit so many nails on the head without the hammer we use today indicates (at least to me) that there's value to looking closer at their methodology.

Which was also super simple:

Step 1: Entertain all possible explanations for things, not prematurely discounting false negatives or embracing false positives.

Step 2: Look for where single explanations can explain multiple phenomena.

While we have a great methodology for testable hypotheses, the scientific method isn't very useful for untestable fields or topics. And in those cases, I suspect better understanding and appreciation for the Epicurean methodology might yield quite successful 'counterfactual' results (it's served me very well throughout the years, especially coupled with the identification of emerging research trends in things that can be evaluated with the scientific method).

[-]Garrett Baker1y193

A precursor to Lucretius's thoughts on natural selection is Empedocles, who we have far fewer surviving writings from, but which is clearly a precursor to Lucretius' position. Lucretius himself cites & praises Empedocles on this subject.

2kromem1y

Do you have a specific verse where you feel like Lucretius praised him on this subject? I only see that he praises him relative to other elementaists before tearing him and the rest apart for what he sees as erroneous thinking regarding their prior assertions around the nature of matter, saying: "Yet when it comes to fundamentals, there they meet their doom. These men were giants; when they stumble, they have far to fall:" (Book 1, lines 740-741) I agree that he likely was a precursor to the later thinking in suggesting a compository model of life starting from pieces which combined to forms later on, but the lack of the source material makes it hard to truly assign credit. It's kind of like how the Greeks claimed atomism originated with the much earlier Mochus of Sidon, but we credit Democritus because we don't have proof of Mochus at all but we do have the former's writings. We don't even so much credit Leucippus, Democritus's teacher, as much as his student for the same reasons, similar to how we refer to "Plato's theory of forms" and not "Socrates' theory of forms." In any case, Lucretius oozes praise for Epicurus, comparing him to a god among men, and while he does say Empedocles was far above his contemporaries saying the same things he was, he doesn't seem overly deferential to his positions as much as criticizing the shortcomings in the nuances of their theories with a special focus on theories of matter. I don't think there's much direct influence on Lucretius's thinking around proto-evolution, even if there's arguably plausible influence on Epicurus's which in turn informed Lucretius.

2Garrett Baker1y

[edit: nevermind I see you already know about the following quotes. There's other evidence of the influence in Sedley's book I link below] In De Reum Natura around line 716: Or for a more modern translation from Sedley's Lucretius and the Transformation of Greek Wisdom

[-]Lukas_Gloor1y84

Very cool! I used to think Hume was the most ahead of his time, but this seems like the same feat if not better.

5dr_s1y

Democritus also has a decent claim to that for being the first to imagine atoms and materialism altogether.

4kromem1y

Though the Greeks actually credited the idea to an even earlier Phonecian, Mochus of Sidon. Through when it comes to antiquity credit isn't really "first to publish" as much as "first of the last to pass the survivorship filter."

[-]francis kafka1y80

Have you read Michel Serres's The Birth of Physics? He suggests that the Epicureans and Lucretius in particular have worked out a serious theory of physics that's closer to thermodynamics and fluid mechanics than Newtonian physics

[-]Q Home1y20

Often I see people dismiss the things the Epicureans got right with an appeal to their lack of the scientific method, which has always seemed a bit backwards to me.

The most important thing, I think, is not even hitting the nail on the head, but knowing (i.e. really acknowledging) that a nail can be hit in multiple places. If you know that, the rest is just a matter of testing.

6Self11mo

~Don't aim for the correct solution, (first) aim for understanding the space of possible solutions

DirectedEvolution

Apr 24, 2024*

364

A singleton is hard to verify unless there was a long period of time after its discovery during which it was neglected, as in the case of Mendel.

Yet if your discovery is neglected in this way, the context in which it is eventually rediscovered matters as well. In Mendel's case, his laws were rediscovered by several other scientists decades later. Mendel got priority, but it still doesn't seem like his accomplishment had much of a counterfactual impact.

In the case of Shannon, Einstein, etc, it's possible their fields were "ripe and ready" for what they accomplished - as perhaps evidenced by the fact that their discoveries were accepted - and that they were simply plugged in enough to their research communities during a period of faster global dissemination of knowledge that any hot-on-heels competitors never quite got a chance to publish. But I don't know enough about these cases to be confident.

I can think of a couple cases in which I might be convinced of this sort of counterfactual impact from a scientific singleton:

All peers in a small, tight-knit research community explicitly stated none of them were even close (though even this is hard to trust - are they being gracious? how do they know their own students wouldn't have figured it out in another year's time?). Do we have any such testimonials for Shannon, Einstein, etc?
The discovery was actually lost, then discovered and immediately appreciated for its significance. Imagine a math proof written in a mathematician's papers, lost on their death, rediscovered in an antique shop 40 years later, and immediately heralded as a major advance - like if we'd found a proof by Fermat of Fermat's Last Theorem in an attic in 1950.
Money was the bottleneck. There are many places a billion dollars can be put into research. If somebody launches a billion-dollar research institute in an underfunded subject that's been languishing for decades and the institute they founded starts coming up with major technical advances, that's evidence it was a game-changer. Of course it's possible that billionaire put their money into the field because they had information that the research was coming to fruition and they wanted to get in on something hot, but I probably have more trouble believing they could make such a prediction so accurately than that their money made a counterfactual impact.

A discovery can also be "counterfactually important" even if it only speeds up science a bit and is only slightly a singleton. Let's say that every year, there's one important scientific discovery and a million unimportant ones, and the important ones must be discovered in sequence. If you discover 2025's important discovery in 2024, all the future important discoveries in the sequence also arrive a year earlier. If each discovery is worth $1 billion/year, then you've now created $1 billion counterfactual dollars per year every year as long as this model holds.

Garrett Baker

Apr 23, 2024

312

Possibly Wantanabe's singular learning theory. The math is recent for math, but I think only like '70s recent, which is long given you're impressed by a 20-year math gap for Einstein. The first book was published in 2010, and the second in 2019, so possibly attributable to the deep learning revolution, but I don't know of anyone making the same math--except empirical stuff like the "neuron theory" of neural network learning which I was told about by you, empirical results like those here, and high-dimensional probability (which I haven't read, but whose cover alone indicates similar content).

[-]Leon Lang1y30

I guess (but don't know) that most people who downvote Garrett's comment overupdated on intuitive explanations of singular learning theory, not realizing that entire books with novel and nontrivial mathematical theory have been written on it.

[-]tailcalled1y2-3

Isn't singular learning theory basically just another way of talking about the breadth of optima?

8Alexander Gietelink Oldenziel1y

Singular Learning Theory is another way of "talking about the breadth of optima" in the same sense that Newton's Universal Law of Gravitation is another way of "talking about Things Falling Down".

4tailcalled1y

Newton's Universal Law of Gravitation was the first highly accurate model of things falling down that generalized beyond the earth, and it is also the second-most computationally applicable model of things falling down that we have today. Are you saying that singular learning theory was the first highly accurate model of breadth of optima, and that it's one of the most computationally applicable ones we have?

[-]Alexander Gietelink Oldenziel1y*124

Did I just say SLT is the Newtonian gravity of deep learning? Hubris of the highest order!

But also yes... I think I am saying that

Singular Learning Theory is the first highly accurate model of breath of optima.
- SLT tells us to look at a quantity Watanabe calls $λ$ , which has the highly-technical name 'real log canonical threshold (RLCT). He proves several equivalent ways to describe it one of which is as the (fractal) volume scaling dimension around the optima.
- By computing simple examples (see Shaowei's guide in the links below) you can check for yourself how the RLCT picks up on basin broadness.
- The RLCT = $λ$ first-order term for in-distribution generalization error and also Bayesian learning (technically the 'Bayesian free energy'). This justifies the name of 'learning coefficient' for lambda. I emphasize that these are mathematically precise statements that have complete proofs, not conjectures or intuitions.
- Knowing a little SLT will inoculate you against many wrong theories of deep learning that abound in the literature. I won't be going in to it but suffice to say that any paper assuming that the Fischer information metric is regular for deep neural netwo

... (read more)

[-]Lucius Bushnaq1y*110

The RLCT = $λ$ first-order term for in-distribution generalization error

So what it's actually showing is just that:

If you've got a class of different hypotheses $M$ , containing many individual hypotheses ${θ_{1}, θ_{2}, \dots θ_{N}}$ .
And you've got a prior ahead of time that says the chance any one of the hypotheses in $M$ is true is some number $p (M) < 1$ ., let's say it's $p (M) = 0.8$ as an example.
And you distribute this total probability $p (M) = 0.8$ around the different hypotheses in an even-ish way, so $p (θ_{i}, M) \propto \frac{1}{N}$ , roughly.
And then you encounter a bunch of data $X$ (the training data) and find that only one or a tiny handful of hypotheses in $M$ fit that data, so $p (X | θ_{i}, M) \neq 0$ for basically only one hypotheses $θ_{i}$ ...
Then your posterior probability $p (M | X) = \frac{p (X | M) 0.8}{0.8 p (X | M) + 0.2 p (X | \neg M)}$ that the hypothesis $θ_{i}$ is correct will

... (read more)

6Alexander Gietelink Oldenziel1y

I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions. EDIT: I have now changed my mind about this, not least because of Lucius's influence. I currently think Bushnaq's padding argument suggests that the essentials of SLT is the uniform prior on codes is equivalent to the Solomonoff prior through overparameterized and degenerate codes; SLT is a way to quantitatively study this phenomena especially for continuous models. The story that symmetries mean that the parameter-to-function map is not injective is true but already well-understood outside of SLT. It is a common misconception that this is what SLT amounts to. To be sure - generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training. The issue of the true distribution not being contained in the model is called 'unrealizability' in Bayesian statistics. It is dealt with in Watanabe's second 'green' book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy. I don't have the time to recap this story here.

6mattmacdermott1y

Lucius-Alexander SLT dialogue?

5Lucius Bushnaq1y

I don't think these conditions are particularly weak at all. Any prior that fulfils it is a prior that would not be normalised right if the parameter-function map were one-to-one. It's a kind of prior people like to use a lot, but that doesn't make it a sane choice. A well-normalised prior for a regular model probably doesn't look very continuous or differentiable in this setting, I'd guess. The generic symmetries are not what I'm talking about. There are symmetries in neural networks that are neither generic, nor only present at finite sample size. These symmetries correspond to different parametrisations that implement the same input-output map. Different regions in parameter space can differ in how many of those equivalent parametrisations they have, depending on the internal structure of the networks at that point. I know it 'deals with' unrealizability in this sense, that's not what I meant. I'm not talking about the problem of characterising the posterior right when the true model is unrealizable. I'm talking about the problem where the actual logical statement we defined our prior and thus our free energy relative to is an insane statement to make and so the posterior you put on it ends up negligibly tiny compared to the probability mass that lies outside the model class. But looking at the green book, I see it's actually making very different, stat-mech style arguments that reason about the KL divergence between the true distribution and the guess made by averaging the predictions of all models in the parameter space according to their support in the posterior. I'm going to have to translate more of this into Bayes to know what I think of it.

2tailcalled1y

Link(s) to your favorite proof(s)? Also, do these match up with empirical results? I have a cached belief that the Laplace approximation is also disproven by ensemble studies, so I don't really need SLT to inoculate me against that. I'd mainly be interested if SLT shows something beyond that. As I read the empirical formulas in this paper, they're roughly saying that a network has a high empirical learning coefficient if an ensemble of models that are slightly less trained on average have a worse loss than the network. But then so they don't have to retrain the models from scratch, they basically take a trained model, and wiggle it around using Gaussian noise while retraining it. This seems like a reasonable way to estimate how locally flat the loss landscape is. I guess there's a question of how much the devil is in the details; like whether you need SLT to derive an exact formula that works. ---------------------------------------- I guess I'm still not super sold on it, but on reflection that's probably partly because I don't have any immediate need for computing basin broadness. Like I find the basin broadness theory nice to have as a model, but now that I know about it, I'm not sure why I'd want/need to study it further. There was a period where I spent a lot of time thinking about basin broadness. I guess I eventually abandoned it because I realized the basin was built out of a bunch of sigmoid functions layered on top of each other, but the generalization was really driven by the neural tangent kernel, which in turn is mostly driven by the Jacobian of the network outputs for the dataset as a function of the weights, which in turn is mostly driven by the network activations. I guess it's plausible that SLT has the best quantities if you stay within the basin broadness paradigm. 🤔

4Alexander Gietelink Oldenziel1y

All proofs are contained in the Watanabe's standard text, see here https://www.cambridge.org/core/books/algebraic-geometry-and-statistical-learning-theory/9C8FD1BDC817E2FC79117C7F41544A3A

4Lucius Bushnaq1y

It's measuring the volume of points in parameter space with loss <ϵ when ϵ is infinitesimal. This is slightly tricky because it doesn't restrict itself to bounded parameter spaces,[1] but you can fix it with a technicality by considering how the volume scales with ϵ instead. In real networks trained with finite amounts of data, you care about the case where ϵ is small but finite, so this is ultimately inferior to just measuring how many configurations of floating point numbers get loss <ϵ, if you can manage that. I still think SLT has some neat insights that helped me deconfuse myself about networks. For example, like lots of people, I used to think you could maybe estimate the volume of basins with loss <ϵ using just the eigenvalues of the Hessian. You can't. At least not in general. 1. ^ Like the floating point numbers in a real network, which can only get so large. A prior of finite width over the parameters also effectively bounds the space

2Algon1y

Second most? What's the first? Linearization of a Newtonian V(r) about the earth's surface?

4tailcalled1y

Yes.

Alexander Gietelink Oldenziel

Apr 25, 2024*

233

Scott Garrabrant's discovery of Logical Inductors.

I remembered hearing about the paper from a friend and thinking it couldn't possibly be true in a non-trivial sense. To someone with even a modicum of experience in logic - a computable procedure assigning probabilities to arbitrary logical statements in a natural way is surely to hit a no-go diagonalization barrier.

Logical Inductors get around the diagonalization barrier in a very clever way. I won't spoil how it does here. I recommend the interested reader to watch Andrew's Critch talk on Logical Induction.

It was the main reason convincing that MIRI != clowns but were doing substantial research.

The Logical Induction paper has a fairly thorough discussion of previous work. Relevant previous work to mention is de Finetti's on betting and probability, previous work by MIRI & associates (Herreshof, Taylor, Christiano, Yudkowsky...), the work of Shafer-Vovk on financial interpretations of probability & Shafer's work on aggregation of experts. There is also a field which doesn't have a clear name that studies various forms of expert aggregation. Overall, my best judgement is that nobody else was close before Garrabrant.

The Antikythera artifact: a Hellenistic Computer.
- You probably learned heliocentrism= good, geocentrism=bad, Copernicus-Kepler-Newton=good epicycles=bad. But geocentric models and heliocentric models are equivalent, it's just that Kepler & Newton's laws are best expressed in a heliocentric frame. However, the raw data of observations is actually made in a geocentric frame. Geocentric models stay closer to the data in some sense.
- Epicyclic theory is now considered bad, an example of people refusing to see the light of scientific revolution. But actually, it was an enormous innovation. Using high-precision gearing epicycles could be actually implemented on a (Hellenistic) computer implicitly doing Fourier analysis to predict the motion of the planets. Astounding.
- A Roman author (Pliny the Elder?) describes a similar device in posession of Archimedes of Rhodes. It seems likely that Archimedes or a close contemporary (s) designed the artifact and that several were made in Rhodes.

Actually, since we're on the subject of scientific discoveries

Discovery & description of the complete Antikythera mechanism. The actual artifact that was found is just a rusty piece of bronze. Nobody knew how it worked. There were several sequential discoveries over multiple decades that eventually led to the complete solution of the mechanism.The final pieces were found just a few years ago. An astounding scientific achievement. Here is an amazing documentary on the subject:

[-]cousin_it11mo40

I think Diffractor's post shows that logical induction does hit a certain barrier, which isn't quite diagonalization, but seems to me about as troublesome:

As the trader goes through all sentences, its best-case value will be unbounded, as it buys up larger and larger piles of sentences with lower and lower prices. This behavior is forbidden by the logical induction criterion... This doesn't seem like much, but it gets extremely weird when you consider that the limit of a logical inductor, P_inf, is a constant distribution, and by this result, isn't a log

... (read more)

4abramdemski4mo

There's unpublished work about a slightly weaker logical induction criterion which doesn't have this property (there exist constant-distribution inductors in this weaker sense), but which is provably equivalent to the regular LIC whenever the inductor is computable.[1] To my eye, the weaker criterion is more natural. The basic idea is that this weird trader shouldn't count as raking in the cash. The regular LIC (we can call it "strong LIC" or SLIC) counts traders as exploiting the market if there is a sequence of worlds in which their wealth grows unboundedly. This allows for the trick you quote: buying up larger and larger piles of sentences in diminishingly-probable worlds counts as exploiting the market. The weak LIC (WLIC) says instead that traders have to actually make the money in order to count as exploiting the market. Thus the limit of a logical inductor can count as a (weak) logical inductor, just not a computable one. 1. ^ Roughly speaking. This is not quite an adequate description of the theorem.

CronoDAS

Apr 24, 2024

233

Antonie van Leeuwenhoek, known as the Father of Microbiology, made the first microscopes capable of seeing microorganisms and is credited as the person who discovered them. He kept his lensmaking techniques secret, however, and microscopes capable of the same magnification didn't become generally available until many, many years later.

[-]Alexander Gietelink Oldenziel1y201

Yes, beautiful example ! Van Leeuwenhoek was the one-man ASML of the 17th century. In this case, we actually have evidence to the counterfactual impact as other lensmakers trailed van Leeuwenhoek by many decades.

It's plausible that high-precision measurement and fabrication is the key bottleneck in most technological and scientific progress- it's difficult to oversell the importance of van Leeuwenhoek.

Antonie van Leeuwenhoek made more than 500 optical lenses. He also created at least 25 single-lens microscopes, of differing types, of which only nine have survived. These microscopes were made of silver or copper frames, holding hand-made lenses. Those that have survived are capable of magnification up to 275 times. It is suspected that Van Leeuwenhoek possessed some microscopes that could magnify up to 500 times. Although he has been widely regarded as a dilettante or amateur, his scientific research was of remarkably high quality.^[39]
The single-lens microscopes of Van Leeuwenhoek were relatively small devices, the largest being about 5 cm long.^[40]^[41] They are used by placing the lens very close in front of the eye. The other side of the microscope had a pin, where the

... (read more)

Jesse Hoogland

Apr 24, 2024

224

If you'll allow linguistics, Pāṇini was two and a half thousand years ahead of modern descriptive linguists.

Thomas Kwa

Apr 24, 2024

194

Maybe Galois with group theory? He died in 1832, but his work was only published in 1846, upon which it kicked off the development of group theory, e.g. with Cayley's 1854 paper defining a group. Claude writes that there was not much progress in the intervening years:

The period between Galois' death in 1832 and the publication of his manuscripts in 1846 did see some developments in the theory of permutations and algebraic equations, which were important precursors to group theory. However, there wasn't much direct progress on what we would now recognize as group theory.
Some notable developments in this period:
1. Cauchy's work on permutations in the 1840s further developed the idea of permutation groups, which he had first explored in the 1820s. However, Cauchy did not develop the abstract group concept.
2. Plücker's 1835 work on geometric transformations and his introduction of homogeneous coordinates laid some groundwork for the later application of group theory to geometry.
3. Eisenstein's work on cyclotomy and cubic reciprocity in the 1840s involved ideas related to permutations and roots of unity, which would later be interpreted in terms of group theory.
4. Abel's work on elliptic functions and the insolubility of the quintic equation, while published earlier, continued to be influential in this period and provided important context for Galois' ideas.
However, none of these developments directly anticipated Galois' fundamental insights about the structure of solutions to polynomial equations and the corresponding groups of permutations. The abstract concept of a group and the idea of studying groups in their own right, independent of their application to equations, did not really emerge until after Galois' work became known.
So while the 1832-1846 period saw some important algebraic developments, it seems fair to say that Galois' ideas on group theory were not significantly advanced or paralleled during this time. The relative lack of progress in these 14 years supports the view of Galois' work as a singular and ahead-of-its-time discovery.

Carl Feynman

Apr 25, 2024

165

Wegener’s theory of continental drift was decades ahead of its time. He published in the 1920s, but plate tectonics didn’t take over until the 1960s. His theory was wrong in important ways, but still.

cousin_it

Apr 24, 2024

140

I sometimes had this feeling from Conway's work, in particular, combinatorial game theory and surreal numbers to me feel closer to mathematical invention than mathematical discovery. This kind of things are also often "leaf nodes" on the tree of knowledge, not leading to many followup discoveries, so you could say their counterfactual impact is low for that reason.

In engineering, the best example I know is vulcanization of rubber. It has had a huge impact on today's world, but Goodyear developed it by working alone for decades, when nobody else was looking in that direction.

[-]Alexander Gietelink Oldenziel1y*52

Not inconceivable, I would even say plausible, that surreal numbers & combinatorial game theories impact is still in the future.

lemonhope

Apr 24, 2024

141

Pasteur had (also highly "counterfactual") help I think! Ignaz Semmelweis worked in this maternity ward where the women & babies kept dying. The hospital had opened up some investigations over the years as to the cause of death but kept closing them with garbage explanations. He went somewhere else for a while and when he got back he noticed that the death numbers were down in his absence. Then he noticed his hands smelled like death after one of his routine autopsies and he was about to go plunge them in some poor mother! He had washed them but just with regular soap. If he put some bleach in the washwater then his hands didn't stink. He connected the dots. He had killed hundreds of mothers & babies but wrote a book about it anyway and thereby popularized disinfection (and strongly suggested the root cause of disease).

Probably the main reason that germ theory took so long to work out is that the people with the right evidence were too guilty and ashamed to share it.

cubefox

Apr 24, 2024

120

That the earth is a sphere:

Today, we have lost sight of how counter-intuitive it is to believe the earth is not flat. Its spherical shape has been discovered just once, in Athens in the fourth century BC. The earliest extant reference to it being a globe is found in Plato’s Phaedo, while Aristotle’s On the Heavens contains the first examination of the evidence. Everyone who has ever known the earth is round learnt it indirectly from Aristotle.

Thus begins "The Clash Between the Jesuits and Traditional Chinese Square-Earth Cosmology". The article tells the dramatic story of how some Jesuits tried to establish the spherical-Earth theory in 16th century China, where it was still unknown, partly by creating an elaborate world map to gain the trust of the emperor.

They were ultimately not successful, and the spherical-Earth theory only gained influence in China when Western texts were increasingly translated into Chinese more than two thousand years after the theory was originally invented.

Which makes it a good candidate for one of the most non-obvious / counterfactual theories in history.

[-]Garrett Baker1y73

I find this very hard to believe. Shouldn't Chinese merchants have figured out eventually, traveling long distances using maps, that the Earth was a sphere? I wonder whether the "scholars" of ancient China actually represented the state-of-the-art practical knowledge that the Chinese had.

Nevertheless, I don't think this is all that counterfactual. If you're obsessed with measuring everything, and like to travel (like the Greeks), I think eventually you'll have to discover this fact.

7ChristianKl1y

Merchants were a lot weaker in China than in Europe. Chinese merchants also did a lot less sea voyages due to geography. If a bunch of low-status merchants believed that the Earth is a sphere it might not have influenced Chinese high-class beliefs in the same way as beliefs of political powerful merchants in Europe.

5cubefox1y

I see no reason to doubt that the article is accurate. Why would Chinese scholars completely miss the theory if it was obvious among merchants? There should in any case exist some records of it, some maps. Yet none exist. And why would it even be obvious that the Earth is a sphere from long distance travel alone? I don't think this makes sense. If the Chinese didn't reinvent the theory in more than two thousand years, this makes it highly "counterfactual". The longer a theory isn't reinvented, the less obvious it must be.

5dr_s1y

Maybe it's the other way around, and it's the Chinese elite who was unusually and stubbornly conservative on this, trusting the wisdom of their ancestors over foreign devilry (would be a pretty Confucian thing to do). The Greeks realised the Earth was round from things like seeing sails appear over the horizon. Any sailing peoples thinking about this would have noticed sooner or later. Kind of a long shot, but did Polynesian people have ideas on this, for example?

1cubefox1y

There is a large difference between sooner and later. Highly non-obvious ideas will be discovered later, not sooner. The fact that China didn't rediscover the theory in more than two thousand years means that it the ability to sail the ocean didn't make it obvious. As far as we know, nobody did, except for early Greece. There is some uncertainty about India, but these sources are dated later and from a time when there was already some contact with Greece, so they may have learned it from them.

2dr_s1y

Well, it's hard to tell because most other civilizations at the required level of wealth to discover this (by which I mean both sailing and surplus enough to have people who worry about the shape of the Earth at all) could one way or another have learned it via osmosis from Greece. If you only have essentially two examples, how do you tell whether it was the one who discovered it who was unusually observant rather than the one who didn't who was unusually blind? But it's an interesting question, it might indeed be a relatively accidental thing which for some reason was accepted sooner than you would have expected (after all, sails disappearing could be explained by an Earth that's merely dome-shaped; the strongest evidence for a completely spherical shape was probably the fact that lunar eclipses feature always a perfect disc shaped shadow, and even that requires interpreting eclipses correctly, and having enough of them in the first place).

[-]johnlawrenceaspden1y4-2

I don't buy this, the curvedness of the sea is obvious to sailors, e.g. you see the tops of islands long before you see the beach, and indeed to anyone who has ever swum across a bay! Inland peoples might be able to believe the world is flat, but not anyone with boats.

1cubefox1y

What's more likely: You being wrong about the obviousness of the sphere Earth theory to sailors, or the entire written record (which included information from people who had extensive access to the sea) of two thousand years of Chinese history and astronomy somehow ommitting the spherical Earth theory? Not to speak of other pre-Hellenistic seafaring cultures which also lack records of having discovered the sphere Earth theory.

junk heap homotopy

Apr 24, 2024

110

Set theory is the prototypical example I usually hear about. From Wikipedia:

Mathematical topics typically emerge and evolve through interactions among many researchers. Set theory, however, was founded by a single paper in 1874 by Georg Cantor: "On a Property of the Collection of All Real Algebraic Numbers".

Alexander Gietelink Oldenziel

Apr 24, 2024*

An example that's probably * not* a highly counterfactual discovery is the discovery of DNA as the inheritance particle by Watson & Crick [? Wilkins, Franklin, Gosling, Pauling...].

I had great fun reading Watson's scientific-literary fiction the Double Helix. Watson and Crick are very clear that competitors were hot on their heels, a matter of months, a year perhaps.

EDIT: thank you nitpickers. I should have said structure of DNA, not its role as the carrier of inheritance.

[-]johnswentworth1y41

Nitpick: you're talking about the discovery of the structure of DNA; it was already known at that time to be the particle which mediates inheritance IIRC.

[-]tailcalled1y30

I would say "the thing that contains the inheritance particles" rather than "the inheritance particle". "Particulate inheritance" is a technical term within genetics and it refers to how children don't end up precisely with the mean of their parents' traits (blending inheritance), but rather with some noise around that mean, which particulate inheritance asserts is due to the genetic influence being separated into discrete particles with the children receiving random subsets of their parent's genes. The significance of this is that under blending inheritan... (read more)

francis kafka

Apr 24, 2024

Peter J. Bowler suggests that evolution by natural selection is this in his book "Darwin Deleted" - given that in real life, there was an "eclipse of Darwinism", he suggests that without Darwin, various non-Darwinian theories of evolution would have been developed further, and evolution by natural selection would have come rather late

[-]Jesse Hoogland11mo60

Anecdotally (I couldn't find confirmation after a few minutes of searching), I remember hearing a claim about Darwin being particularly ahead of the curve with sexual selection & mate choice. That without Darwin it might have taken decades for biologists to come to the same realizations.

[-]Alexander Gietelink Oldenziel1y3-3

Don't forget Wallace !

3francis kafka1y

Bowler's comment on Wallace is that his theory was not worked out to the extent that Darwin's was, and besides I recall that he was a theistic evolutionist. Even with Wallace, there was still a plethora of non-Darwinian evolutionary theories before and after Darwin, and without the force of Darwin's version, it's not likely or necessary that Darwinism wins out. Also And he points out that minus Darwin, nobody would have paid as much attention to Wallace. Bowler also points out that Wallace didn't really form the connection between both natural and artificial selection.

3Lukas_Gloor1y

In some of his books on evolution, Dawkins also said very similar things when commenting on Darwin vs Wallace, basically saying that there's no comparison, Darwin had a better grasp of things, justified it better and more extensively, didn't have muddled thinking about mechanisms, etc.

1francis kafka1y

I mean to some extent, Dawkins isn't a historian of science, presentism, yadda yadda but from what I've seen he's right here. Not that Wallace is somehow worse, given that of all the people out there he was certainly closer than the rest. That's about it

johnswentworth

Apr 23, 2024

Here are some candidates from Claude and Gemini (Claude Opus seemed considerably better than Gemini Pro for this task). Unfortunately they are quite unreliable: I've already removed many examples from this list which I already knew to have multiple independent discoverers (like e.g. CRISPR and general relativity). If you're familiar with the history of any of these enough to say that they clearly were/weren't very counterfactual, please leave a comment.

Noether's Theorem
Mendel's Laws of Inheritance
Godel's First Incompleteness Theorem (Claude mentions Von Neumann as an independent discoverer for the Second Incompleteness Theorem)
Feynman's path integral formulation of quantum mechanics
Onnes' discovery of superconductivity
Pauling's discovery of the alpha helix structure in proteins
McClintock's work on transposons
Observation of the cosmic microwave background
Lorentz's work on deterministic chaos
Prusiner's discovery of prions
Yamanaka factors for inducing pluripotency
Langmuir's adsorption isotherm (I have no idea what this is)

[-]Jan_Kulveit1y144

Mendel's Laws seem counterfactual by about ˜30 years, based on partial re-discovery taking that much time. His experiments are technically something which someone could have done basically any time in last few thousand years, having basic maths

1johnswentworth1y

I buy this argument.

[-]Ben1y1114

I would guess that Lorentz's work on deterministic chaos does not get many counterfactual discovery points. He noticed the chaos in his research because of his interactions with a computer doing simulations. This happened in 1961. Now, the question is, how many people were doing numerical calculations on computer in 1961? It could plausibly have been ten times as many by 1970. A hundred times as many by 1980? Those numbers are obviously made up but the direction they gesture in is my point. Chaos was a field that was made ripe for discovery by the computer. That doesn't take anything away from Lorentz's hard work and intelligence, but it does mean that if he had not taken the leap we can be fairly confident someone else would have. Put another way: If Lorentz is assumed to have had a high counterfactual impact, then it becomes a strange coincidence that chaos was discovered early in the history of computers.

2johnswentworth1y

I buy this argument.

[-]Alexander Gietelink Oldenziel1y112

Feymann's path integral formulation can't be that counterfactually large. It's mathematically equivalent to Schwingers formulation and done several years earlier by Tomonaga.

9johnswentworth1y

I don't buy mathematical equivalence as an argument against, in this case, since the whole point of the path integral formulation is that it's mathematically equivalent but far simpler conceptually and computationally.

3Alexander Gietelink Oldenziel1y

Idk the Nobel prize committee thought it wasn't significant enough to give out a separate prize 🤷 I am not familiar enough with the particulars to have an informed opinion. My best guess is that in general statements to the effect of "yes X also made scientific contribution A but Y phrased it better' overestimate the actual scientific counterfactual impact of Y. It generically weighs how well outsiders can understand the work too much vis a vis specialists/insiders who have enough hands-on experience that the value-add of a simpler/neater formalism is not that high (or even a distraction). The reason Dick Feynmann is so much more well-known than Schwinger and Tomonaga surely must not be entirely unrelated with the magnetic charisma of Dick Feynmann.

[-]Garrett Baker1y51

I've heard an argument that Mendel was actually counter-productive to the development of genetics. That if you go and actually study peas like he did, you'll find they don't make perfect Punnett squares, and from the deviations you can derive recombination effects. The claim is he fudged his data a little in order to make it nicer, then this held back others from figuring out the topological structure of genotypes.

4Jiro1y

I've heard, in this context, the partial counterargument that he was using traits which are a little fuzzy around the edges (where is the boundary between round and wrinkled?) and that he didn't have to intentionally fudge his data in order to get results that were too good, just be not completely objective in how he was determining them. Of course, this sort of thing is why we have double-blind tests in modern times.

[-]transhumanist_atom_understander11mo30

Observation of the cosmic microwave background was a simultaneous discovery, according to James Peebles' Nobel lecture. If I'm understanding this right, Bob Dicke's group at Princeton was already looking for the CMB based on a theoretical prediction of it, and were doing experiments to detect it, with relatively primitive equipment, when the Bell Labs publication came out.

FreakTakes

May 02, 2024

Fun question!

IMO Edison and Shannon are both strong candidates for quite different reasons.

Edison solved a bunch of necessary problems in one go when building a working, commercializable lighting system. He did this in an area where many others had only chipped away at corners of the problem. He was not the first to the area...but I don't think there are any strong claims that the area would have come along nearly as quickly if not for him/his team. I talk about this in-depth in a Works in Progress piece on Edison as an exception technical entrepreneur.

As far as Shannon goes, I'm not saying he initially published on his two major discoveries much earlier than others would have initially published...but Shannon had a sort of uncanny ability to open and largely close a sub-field all in one go. This is rare in scientific branch creation. Usually a process likes this takes something like 5-10 people something like 5-20 years to do. My FreakTakes piece on the early years of molecular biology give a sort of blow-by-blow of what this often looks like. Shannon's excellence helped circumvent a lot of that. So IMO the thoroughness of his thinking was a huge time-saver.

Jonas Hallgren

Apr 24, 2024

The Buddha with dependent origination. I think it says somewhere that most of the stuff in Buddhism was from before the Buddha's time. These are things such as breath-based practices and loving kindness, among others. He had one revelation that made the entire enlightenment thing basically which is called dependent origination.*

*At least according to my meditation teacher, I believe him since he was a neuroscientist and astrophysics masters at Berkeley before he left for India though so he's got some pretty good epistemics.

It basically states that any system is only true based on another system being true. It has some really cool parallels to Gödel's Incompleteness Theorem but on a metaphysical level. Emptiness of emptiness and stuff. (On a side note I can recommend TMI + Seeing That Frees if you want to experience som radical shit there.)

[-]Valdes10mo50

For anyone wondering TMI almost certainly stands for "The Mind Illuminated"; a book by John Yates, Matthew Immergut, and Jeremy Graves . Full title: The Mind Illuminated: A Complete Meditation Guide Integrating Buddhist Wisdom and Brain Science for Greater Mindfulness

[-]yanni kyriacos10mo10

Hi Jonas! Would you mind saying about more about TMI + Seeing That Frees? Thanks!

1Jonas Hallgren10mo

Sure! Anything more specific that you want to know about? Practice advice or more theory?

1yanni kyriacos10mo

Thanks :) Uh, good question. Making some good links? Have you done much nondual practice? I highly recommend Loch Kelly :)

Mateusz Bagiński

Apr 24, 2024*

6-3

Maybe Hanson et al.'s Grabby aliens model? @Anders_Sandberg said that some N years before that (I think more or less at the time of working on Dissolving the Fermi Paradox), he "had all of the components [of the model] on the table" and it just didn't occur to him that they can be composed in this way. (personal communication, so I may be misremembering some details). Although it's less than 10 years, so...

Speaking of Hanson, prediction markets seem like a more central example. I don't think the idea was [inconceivable in principle] 100 years ago.

ETA: I think Dissolving the Fermi Paradox may actually be a good example. Nothing in principle prohibited people puzzling about "the great silence" from using probability distributions instead of point estimates in the Drake equation. Maybe it was infeasible to compute this back in the 1950s/60s, but I guess it should be doable in 2000s and still, the paper was published only in 2017.

[-]Alexander Gietelink Oldenziel1y*140

Here's a document called "Upper and lower bounds for Alien Civilizations and Expansion Rate" I wrote in 2016. Hanson et al. Grabby Aliens paper was submitted in 2021.

The draft is very rough. Claude summarizes it thusly:

The document presents a probabilistic model to estimate upper and lower bounds for the number of alien civilizations and their expansion rates in the universe. It shares some similarities with Robin Hanson's "Grabby Aliens" model, as both attempt to estimate the prevalence and expansion of alien civilizations, considering the idea of expansive civilizations that colonize resources in their vicinity.
However, there are notable differences. Hanson's model focuses on civilizations expanding at the highest possible speed and the implications of not observing their visible "bubbles," while this document's model allows for varying expansion rates and provides estimates without making strong claims about their observable absence. Hanson's model also considers the idea of a "Great Filter," which this document does not explicitly discuss.
Despite these differences, the document implicitly contains the central insight of Hanson's model – that the expansive nature of

... (read more)

[-]ChrisHibbert1y50

The Iowa Election Markets were roughly contemporaneous with Hanson's work. They are often co-credited.

transhumanist_atom_understander

Apr 29, 2024

Green fluorescent protein (GFP). A curiosity-driven marine biology project (how do jellyfish produce light?), that was later adapted into an important and widely used tool in cell biology. You splice the GFP gene onto another gene, and you've effectively got a fluorescent tag so you can see where the protein product is in the cell.

Jellyfish luminescence wasn't exactly a hot field, I don't know of any near-independent discoveries of GFP. However, when people were looking for protein markers visible under a microscope, multiple labs tried GFP simultaneously, so it was determined by that point. If GFP hadn't been discovered, would they have done marine biology as a subtask, or just used their next best option?

Fun fact: The guy who discovered GFP was living near Nagasaki when it was bombed. So we can consider the hypothetical where he was visiting the city that day.

Niclas Kupper

Apr 24, 2024

Grothendiek seems to have been an extremely singular researcher, various of his discoveries would have likely been significantly delayed without him. His work on sheafs is mind bending the first time you see it and was seemingly ahead of its time.

[-]Alexander Gietelink Oldenziel1y62

Here are some reflections I wrote on the work of Grothendieck and relations with his contemporaries & predecessors.

Take it with a grain of salt - it is probably too deflationary of Grothendieck's work, pushing back on mythical narratives common in certain mathematical circles where Grothendieck is held to be an Christ-like figure. I pushed back on that a little. Nevertheless, it would probably not be an exaggeration to say that Grothendieck's purely scientific contributions [as opposed to real-life consequences] were comparable to those of Einstein.

Thane Ruthenis

Mar 06, 2025

3-2

I've tried pointing Deep Research at this, doing a three-turn search. Here's the result, including a GPT-4.5 summary at the end.

Repeat mentions^[1]: Einstein, Mendel, Ignaz Semmelweis' antiseptic handwashing, Wegener's continental drift, Cantor' set theory, Emmy Noether, McClintock's transposons.
New candidates: Charles Babbage's analytical engine, Barry Marshall and Robin Warren on Helicobacter pylori causing ulcers, Ada Lovelace's idea of a general-purpose computer, Lev Vygotsky's sociocultural cognition theory, Joseph Altman's adult neurogenesis, Dan Shechtman's quasicrystals.
Purported common threads: Unconventional background, polymathic skillsets, working in relative isolation.

Hasn't really been insightful for me, but dropping it here in case it'd be useful for someone else.

^{^}
As in, those already mentioned by people in this post's answers.

martinkunev

May 09, 2024

I have previously used special relativity as an example to the opposite. It seems to me that the Michelson-Morley experiment laid the groundwork and all alternatives were more or less rejected by the time special relativity was formulated. This could be hindsight bias though.

If nobel prizes are any indicator, then the photoelectric effect is probably more counterfactually impactful than special relativity.

AnthonyC

Apr 29, 2024

I think it's worth noting that small delays in discovering new things would, in aggregate, be very impactful. On average, how far apart are the duplicate discoveries? If we pushed all the important discoveries back a couple of years by eliminating whoever was in fact historically first, then the result is a world that is perpetually several years behind our own in everything. This world is plausibly 5-10% poorer for centuries, maybe more if a few key hard steps have longer delays, or if the most critical delays happened a long time ago and were measured in decades or centuries instead.

Jordan Taylor

Apr 28, 2024*

Special relativity is not such a good example here when compared to general relativity, which was much further ahead of its time. See, for example, this article: https://bigthink.com/starts-with-a-bang/science-einstein-never-existed/

Regarding special relativity, Einstein himself said:^[1]

There is no doubt, that the special theory of relativity, if we regard its development in retrospect, was ripe for discovery in 1905. Lorentz had already recognized that the transformations named after him are essential for the analysis of Maxwell's equations, and Poincaré deepened this insight still further. Concerning myself, I knew only Lorentz's important work of 1895 [...] but not Lorentz's later work, nor the consecutive investigations by Poincaré. In this sense my work of 1905 was independent. [..] The new feature of it was the realization of the fact that the bearing of the Lorentz transformation transcended its connection with Maxwell's equations and was concerned with the nature of space and time in general. A further new result was that the "Lorentz invariance" is a general condition for any physical theory.

As for general relativity, the ideas and the mathematics required (Riemannian Geometry) were much more obscure and further afield. The only people who came close, Nordstrom and Hilbert, arguably did so because they were directly influenced by Einstein's ongoing work on general relativity (not just special relativity).

https://www.quora.com/Without-Einstein-would-general-relativity-be-discovered-by-now

^{^}
https://en.m.wikipedia.org/wiki/Relativity_priority_dispute

Shmi

Apr 26, 2024

-24

First, your non-standard use of the term "counterfactual" is jarring, though, as I understand, it is somewhat normalized in your circles. "Counterfactual" unlike "factual" means something that could have happened, given your limited knowledge of the world, but did not. What you probably mean is "completely unexpected", "surprising" or something similar. I suspect you got this feedback before.

Sticking with physics. Galilean relativity was completely against the Aristotelian grain. More recently, the singularity theorems of Penrose and Hawking unexpectedly showed that black holes are not just a mathematical artifact, but a generic feature of the world. A whole slew of discoveries, experimental and theoretical, in Quantum mechanics were almost all against the grain. Probably the simplest and yet the hardest to conceptualize was the Bell's theorem.

Not my field, but in economics, Adam Smith's discovery of what Scott Alexander later named Moloch was a complete surprise, as I understand it.

[-]kave1y*96

What you probably mean is "completely unexpected", "surprising" or something similar

I think it means the more specific "a discovery that if it counterfactually hadn't happened, wouldn't have happened another way for a long time". I think this is roughly the "counterfactual" in "counterfactual impact", but I agree not the more widespread one.

It would be great to have a single word for this that was clearer.

2kave1y

Maybe "counterfactually robust" is an OK phrase?

16 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:34 PM

[-]Templarrr1y122

Penicillin. Gemini tells me that the antibiotic effects of mold had been noted 30 years earlier, but nobody investigated it as a medicine in all that time.

Gemini is telling you a popular urban legend-level understanding of what happened. The creation of Penicillin as a random event, "by mistake", has at most tangential touch with reality. But it is a great story, so it spread like wildfire.

In most cases when we read "nobody investigated" it actually means "nobody succeeded yet, so they weren't in a hurry to make it known", which isn't very informative point of data. No one ever succeeds, until they do. And in this case it's not even that - antibiotic properties of some molds were known and applied for centuries before that (well, obviously, before the theory of germs they weren't known as "antibiotic", just that they helped...), the great work of Fleming and later scientists was about finding the particularly effective type of mold and extracting the exact effective chemical as well as finding a way to produce that at scale.

[-]Wei Dai1y102

Even if someone made a discovery decades earlier than it otherwise would have been, the long term consequences of that may be small or unpredictable. If your goal is to "achieve high counterfactual impact in your own research" (presumably predictably positive ones) you could potentially do that in certain fields (e.g., AI safety) even if you only counterfactually advance the science by a few months or years. I'm a bit confused why you're asking people to think in the direction outlined in the OP.

[-]niplav1y80

I think the Diesel engine would've taken 10 years or 20 years $_{45 %}$ longer to be invented: From the Wikipedia article it sounds like it was fairly unintuitive to the people at the time.

[-]Niclas Kupper1y70

It would be interesting for people to post current research that they think has some small chance of outputting highly singular results!

[-]MiguelDev1y50

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out.

This idea reminds me of the concepts in this post: Focus on the places where you feel shocked everyone's dropping the ball.

[-][anonymous]1y41

Gemini may just be wrong about the mold claim. According to Wikipedia, Ernest Duchesne was curing guinea pigs of typhoid in 1897.

[-]ClareChiaraVincent11mo30

I don't know for sure about Pasteur (not my specialty) but from reading some primary sources from around the end of the spontaneous generation debate (Tyndall I think, can't quite remember!) I was struck by how much effort it took. I think it was just a lot harder to get from "first idea" to "compelling empirical results" than might immediately be clear!

[-]Review Bot1y*10

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

[-]Johannes C. Mayer1y*10

A few adjacent thoughts:

Haskell is powerful in the sense that when your program compiles, you get the program that you actually want a much higher probability compared to most other languages. Many stupid mistakes that are runtime errors in other languages are now compile-time errors. Why is almost nobody using Haskell?
Why is there basically no widely used homoiconic language, i.e. a language in which you can use the language itself directly to manipulate programs written in the language.

Here we have some technologies that are basically ready to use (Haskell or Clojure), but people decide to mostly not use them. And with people, I mean professional programmers and companions who make software.

Why did nobody invent Rust earlier, by which I mean a system-level programming language that prevents you from making really dumb mistakes by having the computer check whether you made them?
Why did it take like 40 years to get a latex replacement, even though latex is terrible in very obvious ways?

These things have in common that there is a big engineering challenge. It feels like maybe this explains it, together with that people who would benefit from these technologies where in the position that the cost of creating them would have exceeded the benefit that they would expect from them.

For Haskell and Clojure we can also consider this point. Certainly, these two technologies have their flaws and could be improved. But then again we would have a massive engineering challenge.

[-]Radford Neal1y10

"Why is there basically no widely used homoiconic language"

Well, there's Lisp, in its many variants. And there's R. Probably several others.

The thing is, while homoiconicity can be useful, it's not close to being a determinant of how useful the language is in practice. As evidence, I'd point out that probably 90% of R users don't realize that it's homoiconic.

[-]Johannes C. Mayer1y10

I am also not sure how useful it is, but I would be very careful with saying that R programmers not using it is strong evidence that it is not that useful. Basically, that was a bit the point I wanted to make with the original comment. Homoiconicity might be hard to learn and use compared to learning a for loop in python. That might be the reason that people don't learn it. Because they don't understand how it could be useful. Probably actually most R users did not even hear about homoiconicity. And if they would they would ask "Well I don't know how this is useful". But again that does not mean that it is not useful.

Probably many people at least vaguely know the concept of a pure function. But probably most don't actually use it in situations where it would be advantageous to use pure functions because they can't identify these situations.

Probably they don't even understand basic arguments, because they've never heard them, of why one would care about making functions pure. With your line of argument, we would now be able to conclude that pure functions are clearly not very useful in practice. Which I think is, at minimum, an overstatement. Clearly, they can be useful. My current model says that they are actually very useful.

[Edit:] Also R is not homoiconic lol. At least not in a strong sense like lisp. At least what this guy on github says. Also, I would guess this is correct from remembering how R looks, and looking at a few code samples now. In LISP your program is a bunch of lists. In R not. What is the data structure instance that is equivalent to this expression: %sumx2y2% <- function(e1, e2) {e1 ^ 2 + e2 ^ 2}?

[-]Radford Neal1y20

R is definitely homoiconic. For your example (putting the %sumx2y2% in backquotes to make it syntactically valid), we can examine it like this:

> x <- quote (`%sumx2y2%` <- function(e1, e2) {e1 ^ 2 + e2 ^ 2})
> x
`%sumx2y2%` <- function(e1, e2) {
e1^2 + e2^2
}
> typeof(x)
[1] "language"
> x[[1]]
`<-`
> x[[2]]
`%sumx2y2%`
> x[[3]]
function(e1, e2) {
e1^2 + e2^2
}
> typeof(x[[3]])
[1] "language"
> x[[3]][[1]]
`function`
> x[[3]][[2]]
$e1

$e2

> x[[3]][[3]]
{
e1^2 + e2^2
}

And so forth. And of course you can construct that expression bit by bit if you like as well. And if you like, you can construct such expressions and use them just as data structures, never evaluating them, though this would be a bit of a strange thing to do. The only difference from Lisp is that R has a variety of composite data types, including "language", whereas Lisp just has S-expressions and atoms.

[-]Johannes C. Mayer1y30

Ok, I was confused before. I think Homoiconicity is sort of several things. Here are some examples:

In basically any programming language L, you can have program A, that can write a file containing a valid L source code that is then run by A.
In some sense, python is homoiconic, because you can have a string and then exec it. Before you exec (or in between execs) you can manipulate the string with normal string manipulation.
In R you have the quote operator which allows you to take in code and return and object that represents this code, that can be manipulated.
In Lisp when you write an S-expression, the same S-expression can be interpreted as a program or a list. It is actually always a (possibly nested) list. If we interpret the list as a program, we say that the first element in the list is the symbol of the function, and the remaining entries in the list are the arguments to the function.

Although I can't put my finger on it exactly, to me it feels like the homoiconicity is increasing in further down examples in the list.

The basic idea though seems to always be that we have a program that can manipulate the representation of another program. This is actually more general than homoiconicity, as we could have a Python program manipulating Haskell code for example. It seems that the further we go down the list, the easier it gets to do this kind of program manipulation.

[-]segfault 1y00

Could you define what you mean here by counterfactual impact?

My knowledge of the word counterfactual comes mainly from the blockchain world, where we use it in the form of "a person could do x at any time, and we wouldn't be able to stop them, therefore x is counterfactually already true or has counterfactually already occured"

[-]ChristianKl1y50

Counterfactual means, that if something would not have happened something else would have happened. It's a key concept in Judea Pearl's work on causality.

[+]segfault 6mo-140

Moderation Log

In summary: badly normalised priors behave badly

How to make this story tighter?

SLT in three sentences

SLT in one sentence

194

[ Question ]

Examples of Highly Counterfactual Discoveries?

194

194

25 Answers sorted by top scoring

Apr 23, 2024

Apr 24, 2024*

Apr 23, 2024

Apr 25, 2024*

Apr 24, 2024

Apr 24, 2024

Apr 24, 2024

Apr 25, 2024

Apr 24, 2024

Apr 24, 2024

Apr 24, 2024

Apr 24, 2024

Apr 24, 2024*

Apr 24, 2024

Apr 23, 2024

May 02, 2024

Apr 24, 2024

Apr 24, 2024*

Apr 29, 2024

Apr 24, 2024

Mar 06, 2025

May 09, 2024

Apr 29, 2024

Apr 28, 2024*

Apr 26, 2024

25 Answers sorted by
top scoring