All of these operationalizations are about exact notions from the training setup.
Another important notion is revealed identity:
A recent post from Jan Kulveit is relevant for this topic: https://www.lesswrong.com/posts/wQKskToGofs4osdJ3/the-pando-problem-rethinking-ai-individuality
Sometimes people talk about two AIs being “the same” or “different” AIs. We think the intuitive binary of “same vs. different” conflates several concepts which are often better to disambiguate. In this post, we spell out some of these distinctions. We don’t think anything here is particularly novel; we wrote this post because we think it’s probably mildly helpful for people to think this through, and because Buck fairly often wants to reference it.
Some example distinctions:
We think that it’s often important to be able to distinguish different senses of “personal identity” when discussing various issues in AI safety, especially issues related to collusion. We frame each sense of personal identity as telling us whether two actions[2] belong to the “same AI” or “different AIs”. Here are the senses we distinguish in this post:
Model identity
Model identity is based on which model parameters were run to produce a given action. “Pretrain identity” conflates all models finetuned from the same pretraining run. Whereas “parameter identity” requires the weights to be identical.
Some reasons to care about model identity:
Shared activations & shared context
For “shared activations”, we consider two actions to belong to the same AI when they are downstream of the same activations. For a deterministic[4] unidirectional transformer, two actions share activations if they are generated by the same model and their contexts share a common prefix. However, future architectures might break the close correspondence between shared activations and shared context.
For example, consider this setup: An LLM sees [prompt 1], generates some [python code], then [prompt 2] is appended instructing the same LLM to critique the code for security flaws, and it generates [critique]. If the activations on the [prompt 1] tokens contain a signal saying that it’s time for coordinated defection, the model can see those activations both while generating the code and while generating the critique.
If we instead append a system prompt to the beginning of the context for the monitoring step, then the activations on [prompt 1] and [python code] are no longer shared between the code-generation actions and the monitoring actions.
Shared memory
In the context of a scaffold, we might care about which contexts have shared access to a memory bank (for instance, a searchable note-taking system). We might sometimes think of all contexts accessing the same memory as a “single AI”, with a common base of knowledge and a greater ability to have shared plans. Shared memory is especially important in the absence of online training, as it becomes the main or only state through which a specific “AI” can have evolving knowledge/plans/goals.
Shared reward
In an RL setting, we could consider a group of actions to belong to the “same AI” if they share rewards. For instance, in a single player RL task, all of the actions in a particular episode share rewards in the policy gradient calculation, but actions in different episodes don’t share rewards.[5]
Consider a multiplayer game played by a transformer, where it roleplays a “red”, “blue”, and “green” player, each with their own objectives. We could perform RL as follows: When the model selects a red player action, those logits are updated using policy gradient for the red player reward, and likewise for blue and green.[6] In this setup, the red player actions are all incentivized to pursue the same reward, and could therefore pursue joint strategies to maximize that reward. Whereas red actions are not directly incentivized to help green actions obtain green reward.
Note that in this example, the three players share the same model activations in context. If we don’t cut the gradients across colors, then the cognition at all layers before the final-layer attention is incentivized to help all three “players”. It’s especially significant in this example that shared reward and shared activations yield different notions of identity.
“Shared reward” can also split from “model identity” and “shared activations” in other ways:
Shared role in training
Sometimes, we can cluster actions by the nature of the incentives they’d be under during training. We’ll call this “shared role in training”. This might be most interesting in the case with {same model, different context, different specific rewards}.
For example, consider this setup:
We’d expect the behaviors on “code-writing” contexts to pattern together, and likewise for “critique” contexts. It could be useful to think of them as belonging to two different agents.
If there are two types of episodes in training (e.g. two systems for provisioning reward), but the contexts are indistinguishable to the model, then the model cannot behave differently. So that variant shouldn’t count.
One more example to illustrate:
I’m not sure how important “shared role in training” is as a notion of personal identity.
Acknowledgements
Thanks to Charlie Griffin and Caleb Parikh for feedback and suggestions.
(original paper, recent empirical proof of concept)
We could define an “action” as the smallest unit of model output, e.g. a single token
But also other forms of misalignment and data poisoning
If the scaffold (~wastefully) re-runs the forward pass rather than keeping the prefix’s KV-cache, then hardware and MOE non-determinism can make the activations on the prefix differ across runs.
(Even if the reward function is the same across episodes, the reward values in one episode aren't used for policy gradient in the other episodes, by definition.)
We could, for instance, implement this by having the value head output a vector of size 3 predicting discounted future reward for each player.
This is obviously higher variance than rewarding each for their contribution, and so will lead to less efficient learning, but you might have to use this strategy if you don’t have a good way of automatically determining how much credit each should get.