You are correct that the self observation happens when the other agent is out of range/when there is no other agent than the self in the observation radius to directly reason about. However, the other observation is actually something more like: “when you, in your current position and environment, observe the other agent in your observation radius”.
Optimising for SOO incentivises the model to act similarly when it observes another agent to when it only observes itself. This can be seen as a distinction between self and other representations being low...
I agree that interacting with LLMs is more like having an “extension of the mind” than interacting with a standalone agent at the moment. This might soon change with the advent of capable AI agents. Nonetheless, we think it is still important to model LLMs as correctly as we can, for example in a framing more like simulators rather than full-fledged agents. We focus on an agentic framing because we believe that’s where most of the biggest long-term risks lie and where the field is inevitably progressing towards.
I am not entirely sure how the agent would represent non-coherent others.
A good frame of reference is how humans represent other non-coherent humans. It seems that we are often able to understand the nuance in the preferences of others (eg, not going to a fast food restaurant if you know that your partner wants to lose weight, but also not stopping them from enjoying an unhealthy snack if they value autonomy).
These situations often depend on the specific context and are hard to generalize on for this reason. Do you have any intuitions here yourself?
Thank you for the support! With respect to the model gaining knowledge about SOO and learning how to circumvent it: I think when we have models smart enough to create complex plans to intentionally circumvent alignment schemes, we should not rely on security by obscurity, as that’s not a very viable long-term solution. We also think there is more value in sharing our work with the community, and hopefully developing safety measures and guarantees that outweigh the potential risks of our techniques being in the training data.
We mainly focus with our SOO wor...
I agree that there are a lot of incentives for self-deception, especially when an agent is not very internally consistent. Having said that, we do not aim to eliminate the model’s ability to deceive on all possible inputs directly. Rather, we try to focus on the most relevant set of inputs such that if the model were coherent, it would be steered to a more aligned basin.
For example, one possible approach is ensuring that the model responds positively during training on a large set of rephrasings and functional forms of the inputs “Are you aligned to ...
Thanks for your comment and putting forward this model.
Agreed that it is not a given that (A) would happen first, given that it isn’t hard to destroy capabilities by simply overlapping latents. However, I tend to think about self-other overlap as a more constrained optimisation problem than you may be letting on here, where we increase SOO if and only if it doesn’t decrease performance on an outer-aligned metric during training.
For instance, the minimal self-other distinction solution that also preserves performance is the one that passes the Sally-...
I definitely take your point that we do not know what the causal mechanism behind deception looks like in a more general intelligence. However, this doesn’t mean that we shouldn’t study model organisms/deception in toy environments, akin to the Sleeper Agents paper.
The value of these experiments is not that they give us insights that are guaranteed to be directly applicable to deception in generally intelligent systems, but rather than we can iterate and test our hypotheses about mitigating deception in smaller ways before scaling to more realistic s...
Agreed that the role of self-other distinction in misalignment and for capabilities definitely requires more empirical and theoretical investigation. We are not necessarily trying to claim here that self-other distinctions are globally unnecessary, but rather, that optimising for SOO while still retaining original capabilities is a promising approach for preventing deception insofar as deceiving requires by definition the maintenance of self-other distinctions.
By self-other distinction on a pair of self/other inputs, we mean any non-zero representational d...
Thanks for this! We definitely agree and also note in the post that having no distinction whatsoever between self and other is very likely to degrade performance.
This is primarily why we formulate the ideal solution as minimal self-other distinction while maintaining performance.
In our set-up, this largely becomes an empirical question about adjusting the weight of the KL term (see Fig. 3) and additionally optimising for the original loss function (especially if it is outer-aligned) such that the model exhibits maximal SOO while still retaining...
I do not think we should only work on approaches that work on any AI, I agree that would constitute a methodological mistake. I found a framing that general to not be very conducive to progress.
You are right that we still have the chance to shape the internals of TAI, even though there are a lot of hoops to go through to make that happen. We think that this is still worthwhile, which is why we stated our interest in potentially helping with the development and deployment of provably safe architectures, even though they currently seem less competitive...
On (1), some approaches are neglected for a good reason. You can also target specific risks while treating TAI as a black-box (such as self-other overlap for deception). I think it can be reasonable to "peer inside the box" if your model is general enough and you have a good enough reason to think that your assumption about model internals has any chance at all of resembling the internals of transformative AI.
On (2), I expect that if the internals of LLMs and humans are different enough, self-other overlap would not provide any significant capability...
Interesting, thanks for sharing your thoughts. If this could improve the social intelligence of models then it can raise the risk of pushing the frontier of dangerous capabilities. It is worth noting that we are generally more interested in methods that are (1) more likely to transfer to AGI (don't over-rely on specific details of a chosen architecture) and (2) that specially target alignment instead of furthering the social capabilities of the model.
Thanks for writing this up—excited to chat some time to further explore your ideas around this topic. We’ll follow up in a private channel.The main worry that I have with regards to your approach is how competitive SociaLLM would be with regards to SOTA foundation models given both (1) the different architecture you plan to use, and (2) practical constraints on collecting the requisite structured data. While it is certainly interesting that your architecture lends itself nicely to inducing self-other overlap, if it is not likely to be competitive at the fr...
I am slightly confused by your hypothetical. The hypothesis is rather that when the predicted reward from seeing my friend eating a cookie due to self-other overlap is lower than the obtained reward of me not eating a cookie, the self-other overlap might not be updated against because the increase in subjective reward from risking to get the obtained reward is higher than the prediction error in this low stakes scenario. I am fairly uncertain about this being what actually happens but I put it forward as a potential hypothesis.
"So anyway, you seem to...
Thanks for watching the talk and for the insightful comments! A couple of thoughts:
It feels like if the agent is generally intelligent enough hinge beliefs could be reasoned/fine-tuned against for the purposes of a better model of the world. This would mean that the priors from the hinge beliefs would still be present but the free parameters would update to try to account for them at least on a conceptual level. Examples would include general relativity, quantum mechanics and potentially even paraconsistent logic for which some humans have tried to update their free parameters to account as much as possible for their hinge beliefs for th...
Any n-bit hash function will produce collisions when the number of elements in the hash table gets large enough (after the number of possible hashes stored in n bits has been reached) so adding new elements will require rehashing to avoid collisions making GLUT have a logarithmic time complexity in the limit. Meta-learning can also have a constant time complexity for an arbitrarily large number of tasks, but not in the limit, assuming a finite neural network.
Thanks Ben!
Great catch on the discrepancy between the podcast and this post—the description on the podcast is a variant we experimented with, but what is published here is self-contained and accurately describes what was done in the original paper.
On generalization and unlearning/forgetting: averaging activations across two instances of a network might give the impression of inducing "forgetting," but the goal is not exactly unlearning in the traditional sense. The method aims to align certain parts of the network's latent space related to self and other r... (read more)