I've been meaning to write for a while now:
I've realized that since the derivative is infinitesimal, we can actually strengthen the covariance niceties a lot. If and are arbitrary functions, then I believe that:
I really like this post. Can you expand your intuitions on
For instance, if is the expectation of a binary variable with a probability for being 1, then I bet there is probably going to be a Bernoulli distribution aspect to it, such that is approximately proportional to , but likely with a scale factor that depends on the network architecture or parameters, rather than being entirely equal to it.
Sure!
So let's start with a basic example, an agent that has two actions, "don't" and "do". Suppose it has an output neuron that contains the logits for what action to take, and for simplicity's sake (will address this in the end of the post) let's assume that this output neuron is controlled by a single weight which represents its bias. So this means that the variable described in the OP expands into: .
We can then compute . And, hmm, this actually implies that , rather than the that my intuition suggested, I think? The difference is basically that is flatter than , especially in the tails where the former quadratically goes to 0 while the latter linearly goes to 0.
One thing I would wonder is what happens during training, if we e.g. use policy gradients and give a reward of 1 for do and a reward of -1 for don't. The update rule for policy gradients is basically , which according to Wolfram Alpha expands into , and which we can further simplify to . But we would have to square it to get , so I guess the same point applies here as to before. 🤷
Anyway, obviously this is massively simplified because we are assuming a trivial neural network. In a nontrivial one, I think the principle would be the same, due to the chain rule which gives you a factor of onto whatever gradients exist before the final output neuron.
Actually upon further thought for something like policy gradients, in the limit where the probability is close to , then would probably be more like ? Because you get a factor of from the probability, and then an additional factor of from the derivative of sigmoid/softmax, which adds up to it being .
Price's equation is a fundamental equation in genetics, which can be used to predict how traits will change due to evolution. It can be phrased in many ways, but for the current post I will use the following simplified continuous-time variant:
ddtx=gcov(x,f)=(∇gE[x|g])cov(g,g)⋅(∇gE[f|g])
Here, x represents some genetic trait, f represents the fitness of the organism, g represents the genes of an organism, and gcov represents the genetic covariance between the trait and the fitness. Usually people only use the ddtx=gcov(x,f) part of the equation[1], but I've written out the definition
gcov(a,b)=(∇gE[a|g])cov(g,g)⋅(∇gE[b|g])
because that will make the analogy to neural networks easier to see.
Neural network training and Price's equation
Suppose we train a neural network's weights w using the following equation, where L represents the loss for the network:
ddtw=−∇wL(w)
In that case, if we have some property x(w) of the network (e.g. x could represent how a classifier labels an image, or how an agent acts in a specific situation, or similar), then we can derive an equation for x's evolution over time:
ddtx=(∇wx(w))ddtw=−(∇wx(w))⋅(∇wL(w))
Similar to how we have a concept of genetic covariance to represent the covariance linked to genes, we should probably also introduce a covariance concept linked to neural network weights, to make it cleaner to talk about. I'll call that ntcov (short for neural tangent covariance), defined as:
ntcov(a,b)=(∇wa(w))⋅(∇wb(w))
Furthermore, to make it closer to being analogous, we might replace L with U=−L, yielding the following equation for predicting the evolution of any property x with training under gradient descent:
ddtx=ntcov(x,U)
This makes a bunch of idealistic assumptions about the training process, e.g. that we have an exact measure of the full gradient. It might be worth relaxing the math to more realistic assumptions, and check how much still applies. But for now, let's just charge ahead with the unrealistic assumptions.
Covariance niceties
Covariances play nicely with linear causal effects. If F and G are linear transformations, then cov(Fx,Gy)=Fcov(x,y)G⊤.
For instance, suppose you have a reinforcement learner that has learned to drink juice when close to it. Suppose further that now the main determinant for whether it gets reward is whether it approaches juice when it sees juice. We might formalize that effect as r=fa, where r is the reward given to the agent, f is the frequency at which it sees juice that it can approach, and a is its likelihood of approaching juice if it sees it.
We can then compute: ddta=ntcov(a,r)=ntcov(a,fa)=fntcov(a,a).
ntcov(a,a) is a special quantity which we could call the neural tangent variance ntvar(a). It represents the degree to which a is sensitive to the neural network parameters. For common situations, this may be dependent on the structure of the network, but also more directly on the nature and value of a.
For instance, if a is the expectation of a binary variable with a probability p for being 1, then I bet there is probably going to be a Bernoulli distribution aspect to it, such that ntvar(a) is approximately proportional to p(1−p), but likely with a scale factor that depends on the network architecture or parameters, rather than being entirely equal to it.
In particular, this means that if p is very low (in the juice example, if it is exceedingly rare for the agent to approach juice it sees), then ntvar(a) will also be very low, and this will make ntcov(a,r) low and therefore also make ddta low.
And usually people also put in other terms too to account for various distortions.