Many thanks to Peter Barnett, my alpha interlocutor for the first version of the proof presented, and draft reader.
Analogies are often drawn between natural selection and gradient descent (and other training procedures for parametric algorithms). It is important to understand to what extent these are useful and applicable analogies.
Here, under some modest (but ultimately approximating) simplifying assumptions, natural selection is found to be mathematically equivalent to an implementation of stochastic gradient descent.
The simplifying assumptions are presented first, followed by a proof of equivalence. Finally, a first attempt is made to consider the effect of relaxing each of these assumptions and what departure it causes from the equivalence, and an alternative set of assumptions which retrieves a similar equivalence is presented.
Summary of simplifying assumptions
It is essential to understand that the equivalence rests on some simplifying assumptions, none of which is wholly true in real natural selection.
- Fixed 'fitness function' or objective function mapping genome to continuous 'fitness score'
- Continuous fixed-dimensional genome
- Radially symmetric mutation probability density
- Limit case to infinitesimal mutation
- A degenerate population of 1 or 2
- No recombination or horizontal transfer
(NB assumptions 5 and 6 will be addressed in detail and amended later)
Proof
Setup and assumptions
Let us make some substantial modelling simplifications, while retaining the spirit of natural selection, to yield an 'annealing style' selection process.
We have continuous genome with fixed fitness/objective function
Mutations are distributed according to probability density which is radially symmetric
- so for some (a density, but not a probability density)
- a mutation in one direction is just as likely as a mutation in another direction
We also consider the case of mutations very small relative to the genome, tending to infinitesimal.
Selection is stochastic, but monotonically determined by the fitness differential, according to selection function
- so the probability of being selected over is
- is a monotonically nondecreasing function: a greater 'fitness score' differential can not reduce chances of selection
- i.e.
- a good model for this is something like a softmax, Boltzmann, or logistic function[1], i.e. a normalised ratio of exponentials, but this is not essential for the proof
The population alternates between a single parent and its two offspring, one of which is selected to become the next generation's parent.
- selection according to abstracts any mechanism which results in becoming the ancestor of future generations
- not dying first
- reproducing sufficiently successfully to 'win'
- capturing/converting resources more efficiently
- in this first setup, all offspring are uniparental and no genes are mixed by any means other than mutation
Theorem
Now consider a binary fission from , yielding one perfect clone and one mutant clone where mutation .
Define the next 'step', as whatever mutant offspring eventually happens to be successfully selected vs a perfect clone, according to the selection function on their fitness differential. (If the perfect clone gets selected, it may take many intermediate generations to constitute a 'step' here.) Denote by the resulting normalisation constant over mutations.
So the distribution over given is
Call the mutation . Then we find
by considering the directional derivative of along at and the limit as [2]. (Prior to this infinitesimal limit, we have instead the empirical approximation to the directional derivative.)
Now characterising by length and angle-from-gradient
At this point it is clear that our step procedure depends, stochastically, on how closely the direction of the mutations match the fitness function's gradient.
By inspecting the expected value of the step direction, , we can make a more precise claim
and finally, by noticing the integral of an odd function[3] in
Thus the update between steps, , is a stochastic realisation of a variable whose orientation is, in expectation, exactly the same as that of the gradient of the fitness function.
By similar inspection of we can see that it is a monotonic function of , depending on the particulars of and , which together provide a gradient-dependent 'learning rate'.
So natural selection in this form really is nothing but an implementation of unbiased stochastic gradient descent!
Discussion of simplifying assumptions
To what extent are the simplifying assumptions realistic? What happens to the equivalence when we relax any of the assumptions?
Fixed 'fitness function'
In real natural selection, the interaction between a changing environment and a dynamic distribution of organisms collaborating and competing leads to a time- and location-dependent fitness function.
Variable fitness functions can lead to interesting situations like evolutionarily stable equilibria with mixtures of genotypes, or time- or space-cyclic fitness functions, or (locally) divergent fitness functions, among other phenomena.
Such a nonstationary fitness function is comparable to the use of techniques like self-play in RL, especially in conjunction with Population-Based Training, but is less comparable to vanilla SGD.
As such it may be appropriate to think of real natural selection as performing something locally equivalent to SGD but globally more like self-play PBT.
Continuous fixed-dimensional genome and radially-symmetric mutation probability density
Moving from a continuous to a discrete genome means that the notion of a gradient is no longer defined in the same way, but we can still talk about empirical approximate gradients and differences.
The mechanisms which introduce mutations in real natural selection are certainly symmetrical in certain ways, but probably not in any way which straightforwardly maps to radial symmetry in a fixed-dimensional vector space.
Without radial symmetry, much of the mathematics goes through similarly, but instead of an unbiased estimate of the gradient direction, it is biased by the mutation sampling. As such, we might think of real natural selection as performing a biased stochastic gradient descent.
A comparison may be made to regularisation techniques (depending on whether they are construed as part of the training procedure or part of the objective function), or to the many techniques exploiting bias-variance tradeoffs in sampling-based gradient-estimation for RL, though these tend to be deliberately chosen with variance-reduction in mind, while natural selection may not exhibit such preferences.
Limit case to infinitesimal mutation
In reality, mutations are not infinitesimal, but in practice very small relative to the genome. If we do not take the limit, instead of an exact directional derivative, we find an empirical-approximate directional derivative, yielding empirical-approximate stochastic gradient descent.
This means that in real natural selection, the implied 'step size' or 'learning rate' is coupled with the particulars of the selection strength, the variance of the stochastic gradient, and the degree of empirical approximation applied. In contrast, stochastic gradient descent per se need not couple these factors together.
A degenerate population of 1 or 2
If we expand the population to arbitrary size, it is possible to retrieve the equivalence with additional assumptions.
Instead of a parent individual and cloned and mutated offspring individuals, considering parent and offspring populations, the same reasoning and proof is immediately applicable if we assume that mutations arise sufficiently rarely to be either fixed or lost before the next mutation arises. In this case , the probability of selection, becomes the probability of fixation.
Of course this is not true for real natural selection.
If instead we allow for multiple contemporary mutant populations, an identical treatment can not be applied.
No recombination or horizontal transfer
One of the most fascinating and mathematically complicating aspects of real natural selection is multiple heredity of genome elements, whether via horizontal transfer or sexual recombination.
The preceding proof of equivalence for natural selection and stochastic gradient descent rests on a model which does not include any notion of multiple heredity.
Recovering the equivalence allowing arbitrary population size and recombination
Interestingly, the 'complicating' factor of multiple heredity provides a way to retrieve the equivalence in the presence of multiple contemporary mutations, as long as we continue to consider the limit of infinitesimal mutations.
For a single-heredity population, with multiple contemporary mutant subpopulations, we must either model 'only one winner', or model an ongoing mixture of subpopulations of varying sizes, either of which is unable to model without modification.
On the other hand, in a multiple-heredity population, assuming eventually-universal mixing, and crucially continuing to assume a fixed fitness function (independent of the population mixture), a particular mutation must either fix or go extinct[4].
Proof sketch
So let us consider (instead of and ) and , representing the fixed part of the genotype at times and respectively, that is the initial genome plus all so-far-fixed mutations.
In the time between and the population will experience some integer number of mutation events (perhaps roughly Poisson-distributed but this is inessential for the proof), each of which is distributed according to . Furthermore, at time some mutations from earlier times may be 'in flight' and not yet fixed or extinct.
Assuming fixed fitness, and infinitesimal mutations, we can represent the probability of fixation by time , namely with exactly the same properties as formerly assumed for [5]. Thus each mutation fixed between and satisfies exactly the same unbiased-gradient-sampling property derived earlier, and so, therefore, does their sum .
This relies on all in-flight mutations not affecting the fitness differential, and thus , of their contemporaries, which is certainly the case in the limit of infinitesimal mutations, but not the case for real natural selection.
Summary of additional assumptions
- Eventually-universal mixing
In particular, this means no speciation.
NB we also rely on 4. the limit to infinitesimal mutations, in an additional capacity. We also exclude all 'self-play-like' interactions arising from the larger population by relying further on 1. fixed 'fitness function'.
It may be feasible to retrieve a similar equivalence without excluding population-dependent fitness interactions with a different framing, for example considering gradients over 'mixed strategies' implied by population distributions.
Conclusion
Natural selection, under certain conditions, carries out an implementation of stochastic gradient descent. As such, analogies drawn from one to the other are not baseless; we should, however, examine the necessary assumptions and be mindful of the impact of departures from those assumptions.
In particular, two sets of assumptions are presented here which together are sufficient to retrieve an equivalence:
- Fixed 'fitness function' or objective function mapping genome to continuous 'fitness score'
- Continuous fixed-dimensional genome
- Radially symmetric mutation probability density
- Limit case to infinitesimal mutation
- A degenerate population of 1 or 2
- No recombination or horizontal transfer
or, keeping assumptions 1 to 4 and relaxing assumptions 5 and 6
- Eventually-universal mixing
This is not enough to cover all instances of real natural selection, but provides an approximate mapping from many instances.
Assumptions 2 and 3 together yield 'unbiased' SGD, and in their absence varying degrees of bias arise.
Assumption 1 rules out, most importantly, 'self play' and 'population-based' aspects of natural selection, which have other analogies in machine learning but which are firmly absent from vanilla SGD.
Further work could uncover other parameters of the emergent SGD, such as the variance of the implied gradient, the size of the implicit learning rate, the bias caused by relaxing assumption 3, or quantify the coupling between those factors.
Further scrutiny, especially of the assumptions related to population, 1, 5, 6, and 7, could better quantify the effect of making weaker or different assumptions.
This can be justified in a few ways
- If fitness is something like an Elo rating then a Boltzmann distribution is implied
- If we want to extend the two-individual case to the n-individual case but remain invariant to the arbitrary choice of 'baseline' fitness score, then a normalised ratio of exponentials is implied
- We may further appeal to the maximum entropy property of Boltzmann distributions as a natural choice
The directional derivative in question is, for ,
Cautious readers may note that the integral as presented is not posed in the right coordinate system for its integrand.
By a coordinate transformation from Euclidean to hyperspherical coordinates, centred on , with providing the principal axis, the radial length, the principal angular coordinate, and the other angular coordinates with axes chosen arbitrarily orthogonally,
where we use the fact that the hyperspherical Jacobian is independent of its principal angular coordinate and denote by the result of integrating out the Jacobian over the other angular coordinates, and again noting that the symmetrical integral over an odd function is zero. ↩︎
If we do not have a fixed fitness function, and in particular, if it is allowed to vary dependent on the distribution of the population, there are many evolutionarily stable equilibria which can arise where some trait is stably never fixed nor extinguished, but rather persists indefinitely in some proportion of the population. (A classic example is sex ratios.) ↩︎
We can be more precise if we have where the additional first parameter represents time-elapsed, so that is the probability of a mutation with fitness delta being fixed after elapsed time .
Here we impose on (for fixed time-elapsed) the same monotonicity requirement over fitness differential as imposed on before.
The various 'in-flight' and intervening mutations in the proof also therefore implicitly carry with them , the time they emerged, and the additional argument to is thus .
In practice we should expect to vary time-wise as a monotonically nondecreasing asymptote, but this property is not required for the proof. ↩︎
As an initial aside, I wonder if there is a general factor of spherical-cow-trustingness (SCT?) which separates us. I surely have a moderate amount of SCT! This is not a stance which is easily succinctly justified, but for me comes from having seen (and conjured) many successful spherical cows over the years. Have you read MacKay's Information Theory, Inference, and Learning Algorithms? In the chapter 'Why have sex?', he has an even more spherical model of natural selection than mine here, which I (and many others) consider very illuminating. But, it's so spherical. So spherical. Still works :).
On that note, my models here of evolution seem pretty solid and obviously capturing the spirit, to me. The first one (which I call 'annealing-style') is very close to a classic introduction of simulated annealing, a widely-used/considered simple mutate-and-select algorithm. The second one is less rigorously-presented but captures a lot more of the spirit and detail of natural evolutions, including horizontal transfer and a larger population.
This is a much appreciated critique. You've clearly engaged with this post in some depth and put effort in to describe some disagreements. I'm not currently moved (beyond the existing stance of the original post, which has a bunch of caveats and future work).
On to some specific responses.
No, I think it just is this, the sampling being the stochastic realisation of competition between differently-fit individuals/populations/alleles. If you wanted, you could introduce something like a 'data distribution' (e.g. of environmental scenarios) and an expectation over success-rates on that distribution as the true-loss analogue. But you'd just end up with something of the form of my Ps i.e. there'd be some latent 'true fitness' or 'true Elo-across-envs' (the actual expectation) and sample-based estimation would yield some (locally monotonic) probability of an 'actually fitter' instance being measured as such in any given lineup.[1]
Mild push-back (I'm not sure exactly what you're trying to say here): I don't prescribe a relationship between the gradient and the next step. All I prescribe is that fitter instances are directionally selected/preferred. The gradient comes out of the maths; that's basically the heart of this post!
Your example of correlated directions is really nice. I agree that SGD in full generality can have this. I also think a less spherical model of mutation+selection would also have this! (In fact lots of empirical and theoretical genetics finds/predicts correlations.)
Go back to my 'distribution of environment scenarios' lower-level model. Certainly this could induce correlations if we wanted it to, for basically the same reasons as in your example.
On descendant-generation and population, I refer at first to my spherical-cow-trustingness. To me it obviously captures the spirit! But in any case, I think the details are closer than you seem to allow.
This is only true of one semantic, for the first model. The first model also has another, population-wise semantic. 'Instead of a parent individual and cloned and mutated offspring individuals, considering parent and offspring populations, the same reasoning and proof is immediately applicable if we assume that mutations arise sufficiently rarely to be either fixed or lost before the next mutation arises... Of course this is not true for real natural selection.'
In the second model there is a population of different individuals with different genomes, consisting of a 'fixed' part and a set of 'in flight' (not yet fixed or extinct) mutations. This is a less-spherical cow, but I think captures even more the detail and spirit of real natural selection.
On learning rates, I appreciate this remark. I called this out in the original as something for further research, but (for SCT reasons) I don't consider it especially important/urgent.
As I said elsewhere, any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs ... is relevant to the conclusions we want to draw? (Serious question; my best guess is 'no', but I hold that medium-lightly.)
Now think about the space including SGD, RMSProp, etc. The 'depends on gradient norm' piece which arises from my evolution model seems entirely at home in that family. But I would also be interested in further analysis exploring the details of (various models of) natural selection from a 'learning rate' POV.
Of course, in practical SGD, there are also often learning-rate schedules, including decay but also more wacky things. Would be interesting (but not especially loadbearing) to learn more about how natural selection handles this (I've heard it suggested that mutation-rate, among other things, is meta-selected for adaptability, which seems eminently plausible).
True, usually. But in Darwin's day, 'fitness' just meant 'how fitted are you to the environment?' i.e. 'what's your Elo rating here?', which has later been operationalised in various ways, usually accounting for number or proportion of descendants (either at a genetic unit level or at an inclusive genetic level for organisms).
f is the quantity whose gradient we're ascending. And it's biology-fitness in the pre-operationalisation-as-fecundity/offspring sense.
Please let me know if I've missed anything that seems important to you - these are longish comments.
e.g. imagine characterising the sample-based loss estimate via mean and variance. For now, assume normal and homoscedastic (more spherical cows!). Is my claim about a monotonic selection prob arising from this then basically clear? There's some prob of 'accidentally' getting samples which invert the fitness order, but this prob decreases as the true fitness difference increases. (For non-normal or heteroscedastic, this still holds locally, but the coefficients will be different in different regions.) ↩︎