Alternative title: A simple framework for thinking about applying machine learning to science and engineering in a way that both produces reliable results and poses comparatively ~0 risk of destroying humanity

Mathematics was designed for the human brain.

In our times, this seems like a rather silly design decision. Applied mathematics, be it for physics, chemistry, or economics, run on computers.

To the extent applied mathematics has adopted computers, it was to speed up the solving of problems relevant principally to a pre-computer world. As if our ancestors discovered the combustion engine, then used it to boost the speed of a horse-drawn carriage. To a limited degree, this might work, but the horse becomes a pointless constraint in your design when the engine becomes thousands of times more powerful than the horse.

Motivated reasoning could keep justifying the horse’s existence. Oh, it helps in edge cases where you run out of fuel, it’s easier for humans to “understand” the carriage if it has a horse, and the constraints provided by the horse are ultimately useful in coming up with better designs.

But we don’t, we build cars.

The engine of mathematics now exists, and we ought to think really hard about how to develop mathematics around it, rather than retrofit computers onto mathematics designed for the CNS of apes.

One amendment I should make is that I am not claiming our imagination or thinking is so trivial as to be easily replaced by computers as they are now, even the most powerful and adeptly programmed. I argue that* the** constraint of human minds that have to add, subtract and memorize numbers is limiting our mathematics.*

1. Leaky Abstractions

Take something like sigma summation (∑), at first glance it looks like an abstraction, but it’s too leaky to be a good one.

Leaky is a term from software engineering; it refers to a fault of an abstraction called ‘interface’ which helps us interact with a system. An interface is ‘leaky’ if it does not properly cover the system, thus sometimes compelling us to use direct parts of the system.

Why is it leaky? Because all the concepts it “abstracts” over the need to be understood in order to use the sigma summation. You need to "understand" operations, operators, numbers, ranges, limits, and series.

Or take integrals (∫); To master the abstraction one needs to understand all of calculus and all that is foundational to calculus. Knowing how to use an integral in 2 dimensions given a certain category of functions is entirely different from learning to use it in 3 dimensions, or in 2 dimensions with a different category of functions... and both the dimensions and the category of functions can be literally infinite, so that's a bit of an issue.

Leaky abstractions are the norm in mathematics, which is not surprising, since it’s almost impossible to write proper abstractions without computers to execute the “complex bits”. The only popular counterexample is the integral transform (think Laplace and Fourier transform), which first became useful with the advent of computers, and incidentally appeared concurrently with primitive ideas about mechanical computers capable of loops [MP81].

Indeed, any mathematician in the audience might scoff at me for using the word "abstraction" for what they think of as shorthand notations.* Until recently it was unthinkable one would or could use mathematical notation as a black box, without understanding every single step.** *But mathematical black boxes have now become better writers and artists than 99.x% of people, without anyone quite perfectly understanding how that’s happening, so we’d better get used to it quickly.


It’s worth mentioning that there are no absolute wrong or right facts around science and engineering, which are the core usecases for mathematics. These are broad paradigms we use to manipulate and understand our world, and theories live and die on their practical results and aesthetic appeal. This works because they are ultimately grounded in observations about the world, and while theories live and die, observations endure.

Darwin’s observations about finches were not by any means “wrong”. Heck, most of what Jabir ibn Hayyan found out about the world is probably still correct. What is incorrect or insufficient are the theoretical frameworks they used to explain the observations.

Alchemy might have described the chemical properties and interactions of certain types of matter quite well. We've replaced it with the Bhoring model of chemistry simply because it abstracts away more observations than alchemy.

Using abstractions like sexual competition, DNA, proteins, crossing-over, and mutations fits Darwin’s observations, better than his original “survival of the fittest” evolutionary framework. That’s not to say his theory was wrong, it was a tool that outlived its value.


But unlike science and engineering, mathematics in itself has no “observations” at its foundation. It’s a system of conceptual reasoning and representation, much like language. The fact that 2 * 3 = 6 is not observed, it’s simply “known”. Or… is it?

The statement “5210644015679228794060694325390955853335898483908056458352183851018372555735221 is a prime number” is just as true as the statement “5 is a prime number”.

A well-educated ancient Greek or Alexandrian could be certain that 5 is prime, at both a very intuitive and fundamental level, the same way he could know the meaning of words or the sentences they string together, but could never in a lifetime confirm or deny that “5210644015679228794060694325390955853335898483908056458352183851018372555735221” is prime.

The 10^2th digit of pi doesn’t come “intuitive” until you have a computer, nor does the approximate result of a fundamentally unsolvable integral when parametrized. The bar for provable is higher, a very dedicated ancient mathematician could get the 10^2 digit of pi, but the 10^12th digit is essentially unprovable using human brains, it exceeds mortal lifespans, you need computers to prove this much the same way you need electronic microscopes or the LHC to observe matter at certain scales.

Computers raise the bar for "intuitive” very high. So high that “this is obviously false” brute force approaches can be used to disprove sophisticated conjectures. As long as we are willing to trust the implementation of the software and hardware, we can quickly validate any mathematical tool over a very large finite domain. [PF66]

In many cases, although mathematicians loathe to ignore it, the difference between “provable” and “intuitive” is just a gradient, with “provable” just meaning “intuitive if you put in* **a gigantic amount of time doing it”. Due to lacking computers, most “proved” ideas until recently were specifically those that broke the pattern and could be figured out by brute force. 1

2. Neural Network - A Computer-Era Mathematical Abstraction

Neural networks are a great example of a mathematical abstraction that's very powerful, minimally leaky, and designed for running on a computer: nobody is "solving" neural network training or inference with pen and paper.

Take the specific problem of dimensionality reduction (DR). Something for which hundreds of (bad) human-brain-first mathematical approaches exist. These approaches vary based on the amount of dimension they can handle, the sacrifices in precision they make for the sake of speed, and the underlying assumptions about the data. Nor is it always trivial to compare two such methods.

If we have the neural network abstraction, this problem is dissolved by an autoencoder (AE) [DL16]. I would roughly describe an AE as:

A network that gets the data as an input and has to predict the same data as an output. Somewhere in that network, preferably close to the output layer, add a layer E of size n, where n is the number of dimensions you want to reduce the data to. Train the network. Run your data through the network and extract the output from layer E as the reduced dimension representation of said data. Since we know from training the network that the values generated by E are good enough to reconstruct the input values, they must be a good representation.

This is a very simple explanation compared to that of most “classic” nonlinear DR algorithms. All the computation is safely abstracted away behind the word "train".

As an example of their generality, a subset of AEs basically converge to performing PCA when this best fits the problem. [PA19] & [CV19] (make more sense when read together, in that order)

Most "real" AEs end up using nonlinear transformations, and more broadly, transformations that are more complex in terms of parameters required than “manual” DR methods.

AEs are fairly “generic” compared to any classical DR algorithms, and there are a lot of these algorithms. So one can strive to understand when and why to apply various DR algorithms, or one can strive to understand AEs. The end result will be similar in efficacy when applied to various problems, but understanding AEs is much quicker than understanding a few dozen other DR algorithms.

Even better, with a simple rule of thumb that says “for DR, use AEs” one doesn’t even need to understand AEs, they can just apply them using some library. This is “unideal” but less so than the current approach, where the vast majority of scientists have at one time or another used PCA without understanding it, for cases where it’s a suboptimal DR method.


Neural networks can theoretically replace a lot of the mathematical models which are foundational to science. How much is hard to tell. A lot of scientific theory now exists in lieu of the data that generated it, which is either destroyed or extremely hard to find. Besides, scientists still see themselves as creators and solvers of ape math, which makes them unlikely to appeal to machine learning methods.

But in the coming years, if machine learning can indeed outperform human-brain-ran math in terms of discovery speed and accuracy, as it’s been showing it can in the last few years, I expect a lot of our new scientific theories to revolve around datasets and machine learning models. Rather than around complex chains of symbols loosely aiming at describing reality, while having the primary goal of being easy to compute.

I realize this is a fairly bold claim, and the rest of this article will try to provide intuitions for why they might be true.


Neural networks can be a simple-to-understand abstraction, in that they allow for the creation of complex models from a relatively simple set of operations, the kind a smart 12th-grader understands.

Neural networks are non-leaky, they can be applied with success without understanding the underlying abstractions.* 2

In some cases, neural networks will end up approximating existing mathematical methods without all the necessary conceptual baggage.

We ought to pick our mathematical methods for practical reasons, rather than based on some sort of divine or platonic laws, and current mathematical methods are overly complex and yield little.

3. Cross-Validation Is Better Than Assuming Magical Shapes

Mathematics has many use cases, ranging from purely recreational to very practical. But, by far, almost all of its practical uses come from modeling the world in problems relating to science and engineering.

Given some observations, it’s rather easy to come up with a mathematical equation that fits them, at worst a mere hash table mapping inputs to outputs and adding indeterminacy when overlaps occur will model the problem perfectly. The reasons we don’t build scientific theories or engineering models with input → output maps are twofold:

Out-of-sample generalization

Metaphysical beliefs about how the world should work

I realize that (2) might come as a bit of a surprise, but, our beliefs about metaphysics are what drive the shape of scientific theory. At the atomic or subatomic level, there are no particles or waves, there are strange things our minds could not comprehend that we can model with equations and probability distribution, and the fact that we chose to think of them as “particles” and “waves” is pure metaphysics, it aligns to our intuitions about how the world ought to work.

More broadly, our metaphysics tends to favor things like arcs and straight lines. One argument for using regressions (the hyperplane that best fits our data), is that it’s “the simplest possible model”. This statement is purely subjective, and even from our “walled-in” perspective, it’s easy to see that e.g. a certain number of branches (think if/else statements, of binary decision trees) could be considered simpler than a series of nr. dimensions - 1 parameters multiplied by each “input” dimensions then summed up.

This is a fairly arbitrary choice, arcs, and straight lines don’t even exist, indeed, this is one of the few things we do know about nature, it fundamentally can’t have matter shaped in a way that would fit our definition of a line, or a circle, there’s too much movement and not enough certainty at a “low enough” level to get that. Other mathematical intuitively simple constructs, such as probabilistic decision trees, can actually model reality “all the way down” (if inefficiently).

So why chose straight lines? Why chose arcs? Why choose to define things like normal or T distributions?

The answer might be in part something fundamental about how our brains work (which doesn’t make it “right”), and, in larger part, a matter of efficiency in the pre-computer era. The amount of bits an equation takes to store and the amount of flops it takes to execute has a very loose relation to its representation and solvability by the human mind.


Fine, but then we run into out-of-sample generalization. And one must admit that letting a neural network or some other universal approximator loss at a dataset often leads to rather nonsensical models. There is a tendency for “the universe” to favor simpler models.

Is this solvable? Yes, and in the process of solving this issue, you also get rid of a lot of “dumb” concepts such as the p-value of confidence ranges (the kind chosen arbitrarily based on unfounded assumptions about normality).

The simplest solution is take-one-out cross-validation, we simply train our model on all but 1 datapoint, then validate its accuracy on that one datapoint, rinse, and repeat for every single datapoint.

Of course, this is impractical, and in practice, you might want to make it more efficient and robust by doing things like:

Training the model on a (large) percentage of the data and then validating on a (small) percentage of the data (rinse and repeat, always making sure the validation datasets are entirely different)

Training the model on temporarily older data and validating it on newer data.

Training the model and then testing it on “new” datapoints outside of the dataset boundaries, then “simplifying” it if the results vary too much (or, run an experiment to validate the claims if you’ve done all the simplification you can)

Training the model on more “concentrated” volumes of data and validating it on data that are “further away” (based on some distance function for our datapoints)

These methods can be combined, and the list is not exhaustive. By using the method of cross-validation we can actually make finer-grained statements about the inherent uncertainty of our model.

There are also unrelated methods which can yield a lot of information about a model’s input response that can help here [IM22], but I won’t go into those for the sake of brevity, suffice to say they cover a lot of the edge-cases that cross validation related approaches miss.

A p-value becomes irrelevant, as long as a model performs well under corss-validation we can reasonably assume it will “hold” in the real world unless our examples were biased (a case in which p-values can’t help either).

Confidence ranges can be specified in a more fine grained way, where, depending on the values of a specific datapoint, we can assume different confidence ranges. The same goes for accuracy.

Models can be made to generalize better via techniques like l1 and l2 regularization, pruning ,and running hyper-parameter searches for models with low numbers of parameters that still get an optimal solution.

Even better, unlike a typical approach, we don’t need to have any “the world is composed of straight line” magical thinking in order to assume a model will generalize, we can see it on real data, and we might find new subsets of problems where “shortest equation explaining our data” does not equate to “generalizes better”

This idea is not novel, insofar as people came to realize it’s possible and superior as far back as the 70s [CV74]. To my knowledge, the reason it wasn’t applied in the “hard sciences” is that, until recently, it was impractical. Nowadays, it’s being used in a variety of fields (e.g. fluid dynamics models [AS20]). I leave it to the reader to speculate why it’s not applied in fields like psychology and economics, where the less-cumulative nature of evidence, simpler theoretical models, and more naive mathematical apparatus in use would make it ideal. For what it’s worth, I’ve personally tried doing so and the results are easier to understand and in agreement with a typical statistical approach [TD20].

4. Putting It Together - Universal Modelers

The question I want to ask is something like this:

Could we build new science and engineering, or even rebuild existing models, starting from computer-centric abstractions for modeling data, like those investigate in machine learning?

I think this is possible, not in all cases, but in a significant enough amount of cases. Some scientists and engineers are already familiar with the term universal approximators, which they could use to describe things like neural networks. However, what we need is a broader concept, one of a universal modeler.

A conceptual tool that allows us, given any sort of numerically & relationally quantifiable data, to find one or more models that fit our accuracy and out-of-sample generalizability requirements. These models needn’t “make sense” at the level of the equations being executed. If current ML models are a good indicator, they will be arcane to the human mind.

We only need to understand the methods by which we train them (in the current paradigm those would mostly be: the hyperparameter search, the optimization algorithm, the loss function, any regularization we apply, as well as distillation, pruning and other methods to reduce size & complexity), and their input-response curves, which can be investigated using cross-validation, adjacent techniques and, potentially, more specific model explainability techniques.

More importantly, this “understanding” can be left to the builders of such universal modelers, to a first approximation most scientists could simply use them by just understanding the abstraction itself, since it’s not leaky, you don’t need to solve problems under the cover for it to be useful, it just works, a computer runs it for you.


So why aren’t such universal modelers used?

Well, they are, with great success. The first, rather naive iteration of AlphaFold was a paradigm shift for protein folding, one of the hard problems of molecular biology, setting both a record in accuracy as well as a record in how far advanced it was from any contemporary [AC19]. The improvements to AlphaFold allowed it to essentially “solve” protein folding [HA21] (in so far as we define the problem as predicting the shapes of crystalized protein).

As mentioned before, there’s a lot of work in fluid dynamics using machine learning [RD21], including work that specifically uses a paradigm like the one outlined above [AS20].

A beautiful if primitive example of this was also illustrated by the first DNA methylation aging “clock” [SD13], created using a regularized regression and quantified by it’s cross-validation performance. To the best of my interpretation, more recent “epigenetic clocks” perform better in part because they use better universal modelers [EC20].

There are hundreds of other examples, for any hard-to-solve important energy equation some physicist or engineer has probably tried, even if suboptimally, to at least use universal approximators, even if not universal modelers.

Most of this isn’t accomplished by universal modelers alone, they are usually used on top of theory-ladened hand-made models, and they rely on theory-ladened approaches to encoding data. We can’t deny that certain problems lend themselves very elegantly to simple solutions using arcs, lines, and normal distributions, and abandoning them in those cases would seem premature. But we should more readily try and replace large leaky equations with a solution that relies on universal modelers alone.

Tooling to make this easier is being built, with dozens of libraries in existence, many of them partially or fully open source and free to use [ES22].

One of the main problems we face at the moment is that we’re not moving fast enough in this direction. Our aesthetic sensibilities make us think that using universal modelers is a sort of last resort, to be done only when the results of “beautiful” mathematics are utterly ruinous. I hope the first half of this article makes it clear why this is misguided, universal modelers are superior abstractions that should be prioritized and used as our first line of attack for any problem.

The other problem, mainly present in sciences outside the realm of physics, chemistry, and molecular biology, is one of refusal to adopt. Again, I will leave the potential reasons here up in the air, but, while hard sciences have a strong commitment to falsification, meaning the better theory usually wins out, this is not the case in other areas, where inferior models have long been used, in spite of us knowing better. To some extent, the problem here is one of bringing the issue to public awareness [SH21].

**How much simpler science and ****engineering ****would become if we could recreate it using universal modelers? **Limiting theories to places where they are necessary**, mainly for creating the ontologies which allow us to measure and act. Limiting leaky mathematical models to places where they prove better than universal modelers when evaluated with the same levels of rigor. **

Imagine, for example, if we could understand transistors and analog circuits as pure black boxes, you have your predictive model, you feed your input, and you get the response. In practice, we already model such circuits using what amounts to black-box algorithms. But we insist on adding a make-believe layer of understanding on top, mathematical and theoretical baggage that breaks down when it encounters reality and has to be salvaged by a lot of wasted computed.

Universal modelers have the potential of simplifying science and engineering to the point where true polymaths might once again exist, not only making our world models more accurate and easy to generate but allowing a single individual to fully grasp and use them.

5. An example

Let’s look at a practical example, the first epigenetic clock [SD13]. Horvath chooses an elastic net as his predictive model. This model is trained on (which is to say "made to fit") several datasets containing the methylation status for a few hundred CpGs, and the age of the subject. The model is trained/fit on all but one dataset, which he will leave for validation, this is the dataset on which we determin how well the model generalizes out of sample, it's accuracy. We can repeat this process, training a regression on all-but-one datasets for each dataset and looking at the regression's predictions for the remaining dataset.

The key question we want to answer here is if methylation of said CpGs can track age. But this implies we can compute the distance between ages, and the error function that allows us to say how "well" age was predicted.

In this case, Horvath chooses to use a median error, as well as a Pearson correlation coefficient. In order to know how "right" our models/theories are we must have some sort of error function that tells us how "wrong" we are; Absolute truth doesn't exist, the data never has a full explanation and even if it did our measurements are imperfect, so gradation of truthfulness for theories is important. This is domain-specific and can be chosen based on a variety of criteria.

Horvath runs these error functions both on the data used to train the model, and on the test data on which the models haven't been fit, and gets the numbers (pcc=0.97, me=2.9 years) and (ppc=0.96, me=3.6 years) respectively. Note that, as expected, the model performs better on the data it was fit to emulate, and worst on the test data.

If we want to ask the only relevant question, which is how well the model generalizes, the answer "it will have a median error of 3.6 years and a correlation coefficient of 0.96" is a golden standard, much better than saying "it will have a correlation coefficient of 0.97 with p=0.08", which would be the typical approach. No p-value are needed, since leaving out one sample is the closest equivalent we have to collecting new data in the real world and seeing how well the model generalizes on it. **In the p-value paradigm one could say the above method is equivalent of having a p-value of exactly 0**, though the p-value paradigm is so flawed I'd rather we just stop thinking on its terms entirely.

You'll note that Horvath leaves out datasets, rather than single samples, datasets collected from different people, containing different tissue samples, and analyzed using different devices. The aim of this is to simulate "new" data coming in as much as possible. We should vary everything-but the variables we think are predictive (the GcP site methylation) as much as possible in the left-out samples compared to the training samples and trying to balance this with being able to observe as much data as possible.

The main criticism of predictive approaches in science is that they "ignore theory in favor of data and models". While I'd personally take such a criticism as a point of pride, It should be stressed that this paper doesn't do that, quite the contrary. It is theoretical speculations that lead to the usage of epigenetic data to predict age, and the principal output of the paper is not the age-prediction model, but rather the discovery that epigenetic data can indeed predict age so well. The modeling itself is rather domain-specific, to quote the author:

My "log linear transformation of age" was an important component of the model for the following reasons.

a) It did improve the model fit.

b) it is biologically intuitive. Linear relation after maturity. Non linear relationship before that.

c) Several other papers have used it since it works well. Most notably we used a very similar transformation in our universal pan mammalian clock (Clock3) see our paper on bioRxiv (Ake T Lu).

Of course, what Horvath really wants to give us here is not just a predictor of age but a predictor of health, a new concept of "epigenetic age" which predicts the vitality and time of death for an organism better than chronological age. Here are some tricks he could use for this:

We can demonstrate that "epigenetic age" tracks age pretty linearly except some interventions (this is what the paper does) and assume it is causal, then infer that, any discrepancies between the two due to some intervention are equivalent to de-aging. -- This is pretty weak, as it relies on unverified causal models rather than observation

We can, instead of predicting age, predict "time to death" or "time to disease", this is what [AD22] does, and indeed, it obtains better results than Horvath's algorithm with similar inputs.

We can look at outliers, for example, CgP methylation in cancerous tissue, in people with progeria, or in EBV-infected B cells, and see that their "epigenetic age" is significantly older, which is expected.

Horvath opts for approaches 1 and 3 in this paper.

The final question is what's wrong with this approach. Compared to a bread-and-butter statistical-test approach, nothing, this is a strictly superior. But part of it are subjective and could be changed.

It uses the regression coefficients for each GcP site to estimate their influence and importance. This isn't ideal. A better approach would have been to run this same process, while always leaving out a single GcP site, thus estimating its influence by seeing how we'd predict in its absence. [IM22] provides a more detailed criticism of the use of regression coefficients as well as of alternative methods for determining the importance and effect of features (in this case a feature being methylation data on a single CgP site)

The paper never tries multiple models. I call the model used here a "regression", but an elastic net is in some ways better, while the ultimate model is the same as a regression, a hyperplane through our space, its training (fitting to the data) includes some heuristics that ought to encourage generalizability on unseen data. But other models allow for more parameters to be fitted to the data, and thus more complex explanatory functions (which hopefully yield better results). In this case the accuracy was so well it might have not been worth the bother, and both Horvath himself and other people familiar with the problem have told me in-practice using e.g. MLPs didn't really improve anything. Though there are papers that use more fine-grained model and observe better accuracies on different epigenetic datasets [EC20]

Finally, the error functions are somewhat arbitrary. This can't be helped, finding the "utlimate" error is an epistemological if not metaphysical problem without solution. The best one can do in engineering is find the error functions that best approximate distance from a desired outcome, and the best one can do in science is find error functions that are intuitive. In this case median error and the correlation coefficient are both somewhat intuitive as well as common. They are both unideal for penalizing outliers, but that's arguably desireable, after all part of the underlying theory is that the epigenome is a predictor of lifespan, not age, and can be influenced to increase lifespan, so we'd expect outliers to exist.

I should mention here I'm not criticizing the author one bit, the paper is excellent for its time and still more rigorous, meticulous, and reader-friendly than 99.x% of molecular and cellular biology papers I browse. But looking at things that could have been done differently is useful for gaining a deeper understanding.

6. Progress and AI Safety Implications

Since the replication crisis, it’s become more obvious that a lot of models are not trustworthy and should be discarded. The same can be said about data, but, lying about raw data is harder than lying with models.

In lieu of this, tools that could help us rebuild world models with higher certainty from known data, models that could be wildly agreed upon, with less built-in subjectivity, might be able to salvage a lot of knowledge.

Furthermore, this sort of standardization would help centralized bodies and individuals better understand the progress of scientific and engineering progress years before we see their impacts upon reality (or lack thereof: “Where’s my flying car!?”.


Finally, I think it’s worth noting the AI safety implications here.

At bottom, most people agree that it’s very likely AIs will prove problematic if we allow them to make complex decisions where “good” and “bad” or “true” and “false” are hard to evaluate. From monetary policy, to medicine, to law.

Provided that our models of the world become to complex, leaky, incorrect and disorganized, orders of magnitude too hard for a human mind to make efficient use of, we are economically inviting the creation of very powerful AIs to reason with this data and tell us how to act upon the world, or ct directly.

Conversely, if our models of the world are well abstracted, so well that your average politician or manager could reasonably understand the limits of some scientific finding or engineering project, then the need for alien general AIs becomes lesser, and even rogue actors developing them for nefarious reasons would lose a lot of their edge.

We need to guide our ever-increasing computing power towards scenarios where it can be applied with results that are easy to evaluate and understand by humans. The examples given in this article all fit that mold, powerful machine learning models with well-stated goals that we can use like tools to reduce our cognitive burden and improve our yield from interaction with the world. LLMs like GPT-3 or PaLM are the opposite of that, powerful models which we can’t easily validate, the goals of which (predicting the next most-likely token in a series) are completely unaligned with the problems we hope and ask of them to solve.


Universal modelers seem like a unique chance for improving and accelerating science and engineering, while at the same time making them more legible. The fact that they reduce the economic need for “blind and dumb” intelligent algorithms is the cherry on top.

I strongly believe that, in any positive-sum future we’ll see the trends towards using them accelerate, but that acceleration can’t be high enough, so we should aim to unify our ideas about these abstractions, bring them to public consciousness, and use them as much as humanly possible.

References

Yes I will format these better at some point. The Schmidhuber in-line citation format is superior and I will die on that hill.

[MP81] Sir W Thomson, Minutes of the Proceedings of the Institution of Civil Engineers, Volume 65 Issue 1881, 1881, pp. 2-25

[PF66] F . P. Ramsey, COUNTEREXAMPLE TO EULER'S CONJECTURE, On a problem of formal logic, Proc. London M a t h . Soc. (2) 30 (1930), 264-286, 1966

[DL16] Goodfellow et al, Deep Learning, MIT Press, Chapter 14, 2016

[PA19] Saïd Ladjal, Alasdair Newson, Chi-Hieu Pham, A PCA-like Autoencoder, 2019

[CV19] Michal Rol ́ınek et al, CVPR 2019 Open Access, Variational Autoencoders Pursue PCA Directions (by Accident), 2019

[AS20] Kai Fukami et al, Assessment of supervised machine learning methods for fluid flows, 2020

[CV74] M. Stone, Cross-Validatory Choice and Assessment of Statistical Predictions, Journal of the Royal Statistical Society: Series B (Methodological), 1974

[IM22] Christoph Molnar, Interpretable Machine Learning, chapters 6, 7,8 9,11.2, 2022

[TD20] George Hosu, Is a trillion-dollars worth of programming lying on the ground ?, 2020

[RD21] Paul Garnier et al, A Review on Deep Reinforcement Learning For Fluid Mechanics, 2021

[AC19] Mohammed AlQuraishi, AlphaFold at CASP13, 2019

[HA21] John Jumper et al, Highly accurate protein structure prediction with AlphaFold, 2021

[ES22] Forough Majidi et al, An Empirical Study on the Usage of Automated Machine Learning Tools, 2022

[SH21] Mark Spitznagel, Safe Haven: Investing for Financial Storms, 2021

[EC20] Ricón, José Luis, “Epigenetic clocks: A review”, Nintil (2020-06-16), available at nintil.com/epigenetic-clocks

[SD13] Horvath, S. DNA methylation age of human tissues and cell types.Genome Biol 14, 3156 (2013). https://doi.org/10.1186/gb-2013-14-10-r115

[AD22] Lu et al S. DNA methylation GrimAge version 2. Aging (Albany NY). 2022 Dec 14;14(23):9484-9549. doi: 10.18632/aging.204434.

Footnotes

*1 The mathematics relevant for the 21st century, the kind of mathematics that folds proteins, simulates fluids, predicts the result of complex chemical reactions, or combines the oration skills of Cicero with the intelligence of Von Neumann, the kind that turns our word into complex 3d renderings and then simulates life within them, the kind that will run our laboratories and stand judge over criminal cases… that’s a kind of mathematic where to “prove” and to “brute force” or to “experiment” are fairly similar.

Acknowledgments

Thanks to:

Michael Vasser for motivating me to compile some of my old ideas into this article. As well as for an interesting discussion in Greek mathematics which motivated and modified the first section.

Steve Horvath for giving me in-depth feedback on my explantion & critique of his original epigenetic clock paper, as well as helping me understand his underlying thought process. Though I will stress the views expressed in the section about that paper are not his, and I believe some do go against his position, so please direct any criticism at me.

Stephen Manilla f for suggesting I add the practical example and giving some hints that ended up with me focusing on Horvath's paper, as well as helping me figure out who the target demographic for this might actually be and how I’d want to communicate to them.

Malmesbury and Marius Hosu for some very good editorial suggestions that have made this article more succinct and pleasant to read.

New Comment
4 comments, sorted by Click to highlight new comments since:

The only popular counterexample is the integral transform

A real number can be defined by what rational numbers it is less than, or by what rational sequences converge to it, but the choice of construction soon stops making a difference. It is common for a theorem to screen off the details of a definition.

I don't think the post acknowledges the main value of developing comprehensible, composable pieces of theory? It's natural that scientists want to see results constructed from insights rather than opaque black boxes that just say "It is true" or "it is false", because they can use insights to improve the insight generation process.

You can use each new piece of theory to make a load of new theories, and to discipline and improve the theorist. You can't use the outputs of of a modeler to improve modelers.

This might be a good definition of... one of the thresholds we expect around AGI: Are you learning about learning, or are you just learning about a specific phenomenon external to yourself? A ML model isn't learning about learning. It's not looking at itself.

But human science is always looking at itself because human intelligence was always self-reflective. If you replace parts of science with ML, this self-reflection quality diminishes.

I don't think this would cover the entirety of science, it would just cover the bits that require statistical tests right now. I agree this is not a way to automate science, but statistical models are in themselves not expalinable beyond what a universal modeler is, they are less, since they introduce fake concepts that don't map onto reality, this paradigm doesn't.

Related question: Any thoughts as to why mathematical reasoning (Proof generation, or a trained sensitivity to mathematical beauty/quality.) still hasn't fallen to generative models? Is it just that it's not the kind of thing that has one core secret that we ought to have discovered by now. Is it improving at a certain slow rate similar to robotics and all we can do is wait for more heuristics to be found so that it will one day work well enough.

Oh, and are there mathematical centaurs, yet? People who, instead of striding around using intuitions about what human mathematicians (or themselves) will be able to prove if they sit down and work on it, instead navigate through intuitions about what their machines can prove? (and then prove a lot faster)
I guess if machines can't prove as much as human mathematicians yet, that would be a pretty sad existence! And the state of the art would constantly be changing and invalidating their intuitions!