Very interesting! I'm excited to read your post.
I take back the part about pi and update determining the causal structure, because many causal diagrams are constant with the same poly diagram
I think what is going on here is that both and are of the form with and , respectively. Let's define the star operator as . Then , by associativity of function composition. Further, if and commute, then so do and :
So the commutativity of the geometric expectation and derivative fall directly out of their representation as and , respectively, by commutativity of and , as long as they are over different variables.
We can also derive what happens when the expectation and gradient are over the same variables: . First, notice that , so .. Also .
Now let's expand the composition of the gradient and expectation. , using the log-derivative trick. So .
Therefore, .
Writing it out, we have .
And if I pushed around symbols correctly, the geometric derivative can be pulled inside of a geometric expectation () similarly to how an additive derivative can be pulled inside an additive expectation (). Also, just as additive expectation distributes over addition (), geometric expectation distributes over multiplication ().
If I try to use this framework to express two agents communicating, I get an image with a V1, A1, P1, V2, A2, and P2, with cross arrows from A1 to P2 and A2 to P1. This admits many ways to get a roundtrip message. We could have A1 -> P2 -> A2 -> P2 directly, or A1 -> P2 -> V2 -> A2 -> P1, or many cycles among P2, V2, and A2 before P1 receives a message. But in none of these could I hope to get a response in one time step the way I would if both agents simultaneously took an action, and then simultaneously read from their inputs and their current state to get their next state. So I have this feeling that pi : S -> Action and update : Observation x S -> S already bake in this active/passive distinction by virtue of the type signature, and this framing is maybe just taking away the computational teeth/specificity. And I can write the same infiltration and exfiltration formulas by substituting S_t for V_t, Obs_t for P_t, Action_t for A_t, and S_env_t for E_t.
Actually maybe this family is more relevant:
https://en.wikipedia.org/wiki/Generalized_mean, where the geometric mean is the limit as we approach zero.
The "harmonic integral" would be the inverse of integral of the inverse of a function -- https://math.stackexchange.com/questions/2408012/harmonic-integral
Also here is a nice family that parametrizes these different kinds of average (https://m.youtube.com/watch?v=3r1t9Pf1Ffk)
If arithmetic and geometric means are so good, why not the harmonic mean? https://en.wikipedia.org/wiki/Pythagorean_means. What would a "harmonic rationality" look like?
I really like the idea of finding steering vectors that maximize downstream differences, and I have a few follow-up questions.
Have you tried/considered modifying c_fc (the MLP encoder layer) bias instead of c_proj (the MLP decoder layer) bias? I don't know about this context, but (i) c_fc makes more intuitive sense as a location to change for me, (ii) I have seen more success playing with it in the past than c_proj, and (iii) they are not-equivalent because of the non-linearity between them.
I like how you control for radius by projecting gradients onto the tangent space and projecting the steering vector of the sphere, but have you tried using cosine distance as the loss function so there is less incentive for R to naturally blow up? Let D(z)=∑ni=1∑t∈IicosDist(Zℓtarget,i,t(z),Zℓtarget,i,t(0))in maxzD(z).
When you do iterative search for next steering vectors, I do not expect that constraining the search to an orthogonal subspace to previously found steering vectors to be very helpful, since the orthogonal vectors might very well be mapped into the same downstream part of latent space. Since the memory demands are quite cheap for learning steering vectors, I would be interested in seeing an objective which learned a matrix of steering vectors simultaneously, maximizing the sum of pairwise distances. Suppose we are learning K vectors simultaneously.
maxz1,…,zK∑1≤k<k′≤K∑ni=1∑t∈IicosDist(Zℓtarget,i,t(zk),Zℓtarget,i,t(zk′))
But this form of the objective makes it more transparent that a natural solution is to make each steering vector turn the output into gibberish (unless the LM latent space treats all gibberish alike, which I admit is possible). So maybe we would want a tunable term which encourages staying close to the unsteered activations, while staying far from the other steered activations.
maxz1,…,zn∑1≤k<k′≤K∑ni=1∑t∈IicosDist(Zℓtarget,i,t(zk),Zℓtarget,i,t(zk′))−λ∑Ki=1D(zk)
Lastly, I would be interested in seeing the final output probability distribution over tokens instead of ℓtarget using KL for the distance, since in that domain we can extract very fine grained information from the model's activations. Let Dkl(z)=∑ni=1∑t∈IiKL(Zℓunembed,i,t(z)||Zℓunembed,i,t(0)) in
maxz1,…,zn∑Kk=1∑Kk′=1∑ni=1∑t∈IiKL(Zℓunembed,i,t(zk)||Zℓunembed,i,t(zk′))−λ∑Ki=1Dkl(zk)