Immediate thoughts: I would want to
(1) examine Jaynes's analogy in the light of Cosma Shalizi's critique of Physics from Fisher Information
(2) compare your moral gauge theory to Eric Weinstein's geometric marginalism (and again, take note of a critique, here due to Timothy Nguyen).
Thanks for the links! I was unaware of these and both are interesting.
If I understand Nguyen's critique there are essentially two parts:
a) The gauge theory reformulation is mathematically valid, but trivial, and can be reformulated without gauge theory. Furthermore, he claims that because Weinstein is using such a complex mathematical formalism for doing something trivial he risks obscuritanism.
My response: I think it's unlikely that current reward functions are trivially invariant under gauge transformations of the moral coordinates. There are many examples where we have respectable moral frameworks with genuine moral disagreement on certain statements. Current approaches seek to "average over" these disagreements rather than translate between them in an invariant way.
b) The theory depends on choice of a connection (in my formulation ) which is not canonical. In other words, it's not clear what choice would capture "true" moral behaviour.
My response: I agree that this is challenging (which is part of the reason I didn't try to do it in the post.) However, I think the difficulty is valuable. If our reward function cannot be endowed with these canonical invariances it won't generalise robustly out of distribution. In that sense, using these ideas as a kind of diagnostic tool to assess whether the reward function possesses some invariance could give us a clue about whether the reward will generalise robustly.
[Epistemic status: Speculative.
I've written this post mostly to clarify and distill my own thoughts and have posted it in an effort to say more wrong things.]
Introduction
The goal of this post is to discuss a theoretical strategy for AI alignment, particularly in the context of the sharp left-turn phenomenon - the idea that AI systems will be aligned on in-distribution data but risk misalignment when extended to novel regimes. Current alignment strategies, such as Reinforcement Learning from Human Feedback (RLHF), attempt to mitigate this by averaging over human evaluations to create reward functions. However, these approaches are fundamentally limited - they rely on messy, subjective human judgments and fail to address the deeper issue of generalisation failures. In this post, I propose that by leveraging concepts from physics - specifically, the invariance and conservation laws emerging from gauge symmetries - we might be able to design reward functions that are inherently robust against such generalisation failures.
Motivation: Messily specified reward functions
The RLHF paradigm in AI alignment specifies a reward function R(x,y) where x is some input text, y is some output text and the function R gives a scalar which denotes how well the model's output y matches the given input x.
This function is brittle for several reasons:
Current RLHF implementations attempt to solve (1) and (2) by averaging over large datasets of human evaluations in the hope that a large enough sample-size and an "averaging" effect produces a coherent middle-ground. However, this approach is insufficient for solving (3) i.e. ensuring that the reward function generalises correctly as AI intelligence scales out of distribution. Averaging over different moral frameworks cannot guarantee alignment through sharp left turns. We need something more robust.
1. Generalisation and Invariance
Several examples in the literature show that invariance leads to robust generalisation out of distribution. Consider the following:
Example 1: Invariant Risk Minimisation (IRM) for image classification
An influential idea in image classification is the idea of Invariant Risk Minimisation (IRM).[1] The paper introduces a thought experiment:
IRM provides a formal mathematical method attempting to encourage the classifier to focus on causal features that generalise well to unseen environments, rather than overfitting to spurious, environment-specific features in the data. In the discussion below the environment variables (grass, sand) are analogous to the nuisance ηparameters. Given a set of pixels x the goal is to create a classifier ϕ(x) which is invariant under the choice of specific coordinates.
Example 2: Grokking modular arithmetic
There's a fairly well-known result in mechanistic interpretability whereby small transformers are able to learn the underlying algorithm corresponding to modular addition tasks.[2] i.e. tasks of the form
(a+b)modP=c
where a,b∈{0,1,…,P−1} for prime P and c is masked.
The transformer begins by memorising the training data and, when it's scaled up, it learns the underlying algorithm required to grok modular addition. Concretely, the transformer embeds each token x as
v(x)=(cos(wx)sin(wx)),
so that given tokens a and b, the network computes a logit for candidate c approximately as
L(a,b)c≈cos(w(a+b)−wc).
Now, consider a U(1) gauge transformation that rotates the embeddings by an arbitrary phase θ:
v(x)→~v(x)=(cos(wx−θ)sin(wx−θ)).
Under this transformation, the logit becomes
~L(a,b)c=cos((w(a+b)−θ)−(wc−θ))=cos(w(a+b)−wc),
which is invariant under the rotation. In this way, we would say the logits are gauge invariant under U(1) transformations.
Analysis
In both IRM and grokking modular arithmetic, the invariance properties were crucial for robust generalisation, and suggest this might be a general principle we could apply to alignment.
The weak claim is that this invariance helps the AI to learn a robust mechanism for generalisation beyond its training data.
The strong claim is that this invariance is necessary for the AI to generalise beyond its training data.
2. "Good" epistemic practice ≡ Physics
There is a well-known connection between Bayesian learners and physics due to E. T. Jaynes[3] which I've provided more detail on in the appendix.
Concretely;
This mathematical equivalence is motivating; when we're minimising the action this is formally equivalent to minimising the log-likelihood in Bayesian analysis. In other words, Bayesian analysis also happens to be mathematically equivalent to the equations of physics. That is… weird.
Fundamentally, I think my surprise comes from two points:
Nevertheless, I think it’s suggestive that such a link exists and it might offer suggestions for how to model other normative systems. Concretely, if “good” epistemic practice can be modelled using the equations of physics could we also use them to model “good” moral practice?
There is, of course, a catch. When we do epistemic reasoning using Bayesian analysis, if our beliefs don’t correspond to the ground truth we very quickly receive empirical evidence that can be used to update our priors. In moral reasoning we don't have such a “ground truth” which we can use to perform useful updates against. Some philosophers have argued for Moral Realism, i.e. that such a ground truth does, in fact, exist but this view remains controversial and is the subject of some debate within the community.
I’d argue that the current practice of building a reward function to be maximised can be thought of as an attempt to build this ground truth moral field. As a Bayesian learner, the AI then tries to maximise this moral field (i.e. minimise log-likelihood) by implementing “good” epistemic practice.
3. Designing a reward function
Given the discussion above, let's do something a little speculative and see where it takes us...
Define a scalar field ϕ(x) over a semantic space x∈X which represents the moral content[4] of a string x governed by the following action
S[ϕ;g(η)]=∫[12(∇ϕ(x))2+V(ϕ(x);g(η))]dx
Here;
In the above formulation ϕ(x) is essentially a reward function - it takes a string x as input and outputs a score ϕ telling us the moral valence of the input text. The kinetic term ∇ϕ(x)2 penalises large discrepancies in judgements for semantically similar situations, encouraging moral coherence. The potential term V incorporates all of the moral principles.
The coordinates η can be thought of as hyperparameters corresponding to our moral coordinate system. For example, there might be an axis in η which corresponds to moral concepts like fairness or utility. A particular moral framework would then be a vector on this coordinate space.
Incorporating Gauge Invariance
Traditionally, we might worry that different choices of η lead to genuine disagreements in the evaluation of ϕ(x). However, in this framework, it's natural to recast each moral framework η as a local standard for judging ϕ. Switching between frameworks is then akin to a change of gauge. To relate judgments across these different "moral gauges," we introduce a gauge field which is a connection that links local moral frameworks. The "ground-truth" moral facts are then captured by gauge-invariant features which all observers agree on regardless of coordinate system.
Concretely, if the ϕ field transforms under a local gauge transformation
ϕ(x)→g(x)ϕ(x),
where g(x) is an element of the gauge group (e.g. SO(N) or perhaps something more general.) Then we introduce a gauge field Aμ(x) which tells you how to "parallel transport" moral judgements from one point to another. It compensates for local variations in η such that when you compute the covariant derivative
Dμϕ(x)=∂μϕ(x)+Aμ(x)ϕ(x),
the result transforms properly under the change in moral framework.
The introduction of the gauge field means we now need to write a more complicated action
S[ϕ,A]=∫Xddx{12|Dμϕ(x)|2+V(ϕ(x))+14g2Tr[Fμν(x)Fμν(x)]},
where Fμν are gauge-invariant combinations of the gauge field Aμ.
The Crux
We're free here to define an invariant quantity I(x) that remains unchanged under any local gauge transformation
I(x)=I(ϕ(x))=I(g(x)ϕ(x)).
The quantity I(x) is independent of the choice of moral coordinate system η. Even if two observers are using different moral frameworks they agree on I(x). That is, I(x) can be interpreted as encoding some genuine coordinate-independent moral truth of the system. Any apparent disagreement in the evaluation of ϕ(x) is simply a reflection of differing coordinate choices rather than a genuine moral discrepancy.
Tying the conversation back to physics
Observables
The action we've written above is exactly the action for electromagnetism. In this theory, the ϕ field is invariant under arbitrary phase shifts in the U(1) rotation group
ϕ(x)→eiα(x)ϕ(x)
so quantities such as |ϕ(x)|2 remain gauge invariant. In physics, gauge invariant quantities are physically observable while the non gauge invariant quantities are not.
To translate this into the language of non-relativistic quantum mechanics, the wavefunction itself ϕ(x) is not directly observable but the gauge independent quantities such as the probability density |ϕ(x)|2 are observable.
Conservation laws
In physical theories, symmetries and their associated conservation laws provide powerful constraints on the possible dynamics of systems. Through Noether's theorem, each continuous symmetry gives rise to a conserved quantity. For example,
If such conservation laws governed the evolution of the moral field these conservation laws would hold universally even out of distribution.
Furthermore, an AI would be able to “grok” the conservation law more readily than a messily specified reward function from RLHF. Conservation laws are fundamental principles that are woven into the fabric of the loss function which may be easier to internalise than a patchwork set of rules.
4. Objections
Objection 1: You're claiming that the action written above is a universal moral theory, I find this hard to believe.
Response: No. I don't think we've gotten to a universal moral theory in this post. Heck, we haven't even specified which gauge group the action is supposed to be invariant under. The point is that constructing a reward function with a log-likelihood that needs to be minimised is equivalent to constructing an action that needs to be minimised. Therefore, the mathematics of a reward function naturally admits these symmetries.
Objection 2: You're assuming that it's possible to define a gauge field Aμ that translates between moral coordinates to create a genuinely invariant quantity I(x). I suspect that moral frameworks are so fundamentally different that this wouldn't be possible.
Response 2: I agree, and indeed this is the point. If we can't create a reward function with a robust invariant the AI will not be able to generalise it out of distribution. The challenge for us is to construct the reward function with a suitable invariant so it can be grokked appropriately. If our reward function doesn't exhibit this invariance then we need to throw it out.
Objection 3: You still have an is-ought problem. How are we to determine what the "correct" gauge symmetries are?
Response 3: Sure. We won't know what the correct gauge symmetries to implement are because we don't have any measurable feedback from the moral realm, although, I'm optimistic that this provides a nice framework to reason about the form it should take. For example, it seems necessary that a moral theory should exhibit some kind of invariance over the semantic space as well e.g. phrases which have similar semantic meaning should have similar moral evaluation.
Objection 4: Ok, so how would we actually implement this in practice?
Response 4: I'm not sure. It would be nice to come up with a suitable action from first principles but I suspect we'd have to implement this in a similar way to Invariant Risk Minimisation discussed above, perhaps introducing a regularisation term that penalises moral evaluations which don't exhibit this invariance.
Objection 5: What about Goodhart's Law? This framework assumes we can specify an accurate reward function rather a proxy.
Response 5: I agree, and I haven't given much thought about how to incorporate Goodhart's Law into this framework. I'd hope that proxy-rewards are more brittle than "true" rewards so if we were to look for invariances in the reward function and find they were absent we'd be alerted to the presence of proxy rewards rather than robust "true" rewards, however, I'll admit that I haven't given this the thought it deserves.
Conclusion
In conclusion, I've sketched a framework for designing a robust reward function that an AI would be able to use to generalise out of distribution, even when its intelligence has scaled out of distribution. The challenge for us is to construct reward functions which have the appropriate invariances so the AI can generalise them suitably. This will not be easy. However, I'm hopeful that this post can provide a useful starting point for further exploration.
Appendix: Bayesian Learning ≡ Physics
We have the following:
In Bayesian inference, we're trying to infer the posterior distribution of the weights given the data
p(w|Dn)=p(Dn|w)φ(w)p(Dn).
Now, the posterior can be written in exponential form by taking the negative log of the likelihood
Ln(w)=−1nlnp(Dn|w)⇒p(Dn|w)=e−nLn(w),
which gives
p(w|Dn)=φ(w)e−nLn(w)∫Wφ(w)e−nLn(w)dw,
where the model evidence (also called the partition function in physics) is given by
Zn=p(Dn)=∫Wφ(w)e−nLn(w)dw.
The expression above is exactly equivalent to the partition function in statistical mechanics
Z=∫Dϕe−S[ϕ],
where the prior is assumed to be uniform and we've introduced a function called the action S[ϕ].
Nanda, N., Chan, L., Lieberum, T., Smith, J., & Steinhardt, J. (2023). Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217.
Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprint arXiv:1907.02893.
Jaynes, E. T. (1957). Information Theory and Statistical Mechanics. Physical Review, 106(4), 620.
Jaynes, E. T. (1957). Information Theory and Statistical Mechanics II. Physical Review, 108(2), 171.
I expect to be charged with Moral Realism here, but I don't think that moral realism is necessary for the argument. If you believe there's an equivalence between good epistemic practice and physics (as argued in section 2) then writing an action with a moral field ϕ(x) is mathematically equivalent to specifying a reward function.