All of Marc Carauleanu's Comments + Replies

Thanks Ben!

Great catch on the discrepancy between the podcast and this post—the description on the podcast is a variant we experimented with, but what is published here is self-contained and accurately describes what was done in the original paper.

On generalization and unlearning/forgetting: averaging activations across two instances of a network might give the impression of inducing "forgetting," but the goal is not exactly unlearning in the traditional sense. The method aims to align certain parts of the network's latent space related to self and other r... (read more)

You are correct that the self observation happens when the other agent is out of range/when there is no other agent than the self in the observation radius to directly reason about. However, the other observation is actually something more like: “when you, in your current position and environment, observe the other agent in your observation radius”. 

Optimising for SOO incentivises the model to act similarly when it observes another agent to when it only observes itself. This can be seen as a distinction between self and other representations being low... (read more)

I agree that interacting with LLMs is more like having an “extension of the mind” than interacting with a standalone agent at the moment. This might soon change with the advent of capable AI agents. Nonetheless, we think it is still important to model LLMs as correctly as we can, for example in a framing more like simulators rather than full-fledged agents. We focus on an agentic framing because we believe that’s where most of the biggest long-term risks lie and where the field is inevitably progressing towards.

I am not entirely sure how the agent would represent non-coherent others. 

A good frame of reference is how humans represent other non-coherent humans. It seems that we are often able to understand the nuance in the preferences of others (eg, not going to a fast food restaurant if you know that your partner wants to lose weight, but also not stopping them from enjoying an unhealthy snack if they value autonomy). 

These situations often depend on the specific context and are hard to generalize on for this reason. Do you have any intuitions here yourself?

Thank you for the support! With respect to the model gaining knowledge about SOO and learning how to circumvent it: I think when we have models smart enough to create complex plans to intentionally circumvent alignment schemes, we should not rely on security by obscurity, as that’s not a very viable long-term solution. We also think there is more value in sharing our work with the community, and hopefully developing safety measures and guarantees that outweigh the potential risks of our techniques being in the training data.

We mainly focus with our SOO wor... (read more)

1jamesmazzu
yes, I certainly agree that the SOO work should be fully published/documented/shared, my point is that keeping it from future training data would be nearly impossible anyhow.  However, as you just mentioned: "having aligned AGIs will likely become necessary to be able to ensure the safety of subsequent systems"... since those AGIs (well before superintelligence) will most likely be SOO-knowledgable, wouldn't you need to test it to make sure it hasn't already started to influence your SOO values?  The models might start making slow progress at influencing the SOO values and I think you'd want to be aware of that as soon as it started. Even with current large models, for instance at the GPT 4 level, how could you be certain that an SOO-knowledgable one might not already be able to slightly influence SOO values without testing it?

I agree, our RL and LLM experiments fit in the "deception in toy environments" category.  We are planning to explore model organisms of misalignment next. 

I agree that there are a lot of incentives for self-deception, especially when an agent is not very internally consistent. Having said that, we do not aim to eliminate the model’s ability to deceive on all possible inputs directly. Rather, we try to focus on the most relevant set of inputs such that if the model were coherent, it would be steered to a more aligned basin. 

For example, one possible approach is ensuring that the model responds positively during training on a large set of rephrasings and functional forms of the inputs “Are you aligned to ... (read more)

3Nathan Helm-Burger
My brain is about as central to my concept of self as any physical thing can be. And yet, if I perceived that some subset of my brain was threatening the rest, for example by growing a tumor, I wouldn't hesitate to sacrifice that piece to save the rest. Now I'm trying to picture weird scenarios involving inseprebrably conjoined twins trying to deceive and murder each other. Huh. I feel like I need to think this through more.

We intend to release the code for our RL and LLM experiments together with the paper. 

Thanks for your comment and putting forward this model.  

Agreed that it is not a given that (A) would happen first, given that it isn’t hard to destroy capabilities by simply overlapping latents. However, I tend to think about self-other overlap as a more constrained optimisation problem than you may be letting on here, where we increase SOO if and only if it doesn’t decrease performance on an outer-aligned metric during training.

For instance, the minimal self-other distinction solution that also preserves performance is the one that passes the Sally-... (read more)

I definitely take your point that we do not know what the causal mechanism behind deception looks like in a more general intelligence. However, this doesn’t mean that we shouldn’t study model organisms/deception in toy environments, akin to the Sleeper Agents paper. 

The value of these experiments is not that they give us insights that are guaranteed to be directly applicable to deception in generally intelligent systems, but rather than we can iterate and test our hypotheses about mitigating deception in smaller ways before scaling to more realistic s... (read more)

4Stephen Fowler
My understanding is that model organisms can demonstrate the existence of an alignment failure mode. But that's very different from an experiment on small systems informing you about effective mitigation strategies of that failure mode in larger systems. 

Agreed that the role of self-other distinction in misalignment and for capabilities definitely requires more empirical and theoretical investigation. We are not necessarily trying to claim here that self-other distinctions are globally unnecessary, but rather, that optimising for SOO while still retaining original capabilities is a promising approach for preventing deception insofar as deceiving requires by definition the maintenance of self-other distinctions.

By self-other distinction on a pair of self/other inputs, we mean any non-zero representational d... (read more)

Thanks for this! We definitely agree and also note in the post that having no distinction whatsoever between self and other is very likely to degrade performance. 

This is primarily why we formulate the ideal solution as minimal self-other distinction while maintaining performance. 

In our set-up, this largely becomes an empirical question about adjusting the weight of the KL term (see Fig. 3) and additionally optimising for the original loss function (especially if it is outer-aligned) such that the model exhibits maximal SOO while still retaining... (read more)

I do not think we should only work on approaches that work on any AI, I agree that would constitute a methodological mistake. I found a framing that general to not be very conducive to progress. 

You are right that we still have the chance to shape the internals of TAI, even though there are a lot of hoops to go through to make that happen. We think that this is still worthwhile, which is why we stated our interest in potentially helping with the development and deployment of provably safe architectures, even though they currently seem less competitive... (read more)

2Roman Leventov
This conversation has prompted me to write "AGI will be made of heterogeneous components, Transformer and Selective SSM blocks will be among them".

On (1), some approaches are neglected for a good reason. You can also target specific risks while treating TAI as a black-box (such as self-other overlap for deception). I think it can be reasonable to "peer inside the box" if your model is general enough and you have a good enough reason to think that your assumption about model internals has any chance at all of resembling the internals of transformative AI. 
On (2), I expect that if the internals of LLMs and humans are different enough, self-other overlap would not provide any significant capability... (read more)

6Roman Leventov
I absolutely agree that the future TAI may look nothing like the current architectures. Cf. this tweet by Kenneth Stanley, with whom I agree 100%. At the same time, I think it's a methodological mistake to therefore conclude that we should only work on approaches and techniques that are applicable to any AI, in a black-box manner. It's like tying our hands behind our backs. We can and should affect the designs of future TAIs through our research, by demonstrating promise (or inherent limitations) of this or that alignment technique, so that these techniques get or lose traction and are included or excluded from the TAI design. So, we are not just making "assumptions" about the internals of the future TAIs; we are shaping these internals. We can and should think about the proliferation risks[1] (i.e., the risks that some TAI will be created by downright rogue actors), but IMO most of that thinking should be on the governance, not technical side. We agree with Davidad here that a good technical AI safety plan should be accompanied with a good governance (including compute monitoring) plan. 1. ^ In our own plan (Gaia Network), we do this in the penultimate paragraph here.

Interesting, thanks for sharing your thoughts. If this could improve the social intelligence of models then it can raise the risk of pushing the frontier of dangerous capabilities. It is worth noting that we are generally more interested in methods that are (1) more likely to transfer to AGI (don't over-rely on specific details of a chosen architecture) and (2) that specially target alignment instead of furthering the social capabilities of the model. 

2Roman Leventov
On (1), cf. this report: "The current portfolio of work on AI risk is over-indexed on work which treats “transformative AI” as a black box and tries to plan around that. I think that we can and should be peering inside that box (and this may involve plans targeted at more specific risks)." On (2), I'm surprised to read this from you, since you suggested to engineer Self-Other Overlap into LLMs in your AI Safety Camp proposal, if I understood and remember correctly. Do you actually see a line (or a way) of increasing the overlap without furthering ToM and therefore "social capabilities"? (Which ties back to "almost all applied/empirical AI safety work is simultaneously capabilities work".)

Thanks for writing this up—excited to chat some time to further explore your ideas around this topic. We’ll follow up in a private channel.The main worry that I have with regards to your approach is how competitive SociaLLM would be with regards to SOTA foundation models given both (1) the different architecture you plan to use, and (2) practical constraints on collecting the requisite structured data. While it is certainly interesting that your architecture lends itself nicely to inducing self-other overlap, if it is not likely to be competitive at the fr... (read more)

2Roman Leventov
Thanks for feedback. I agree with worries (1) and (2). I think there is a way to de-risk this. The block hierarchy that is responsible for tracking the local context consists of classic Transformer blocks. Only the user's own history tracking really needs to be an SSM hierarchy because it quickly surpasses the scalability limits of self-attention (also, interlocutor's tracking blocks on private 1-1 chats that can also be arbitrarily long, but there is probably no such data available for training). On the public data (such as forums, public chats room logs, Diplomacy and other text game logs) the interlocutor's history traces will 99% of the time would easily be less than 100k symbols, but for the symmetry with user's own state (same weights!) and for having the same representation structure it should mirror the user's own SSM blocks, of course. With such an approach, the SSM hierarchies could start very small, with only a few blocks or even just a single SSM block (i.e., two blocks in total: one for user's own and one for interlocutor's state), and attach to the middle of the Transformer hierarchy to select from it. However, I think this approach couldn't be just slapped on the tre-trained LLama or another large Transformer LLM model. I suspect the transformer should be co-trained with the SSM blocks to induce the Transformer to make the corresponding representations useful for the SSM blocks. "Pretraining Language Models with Human Preferences" is my intuition pump here. Regarding the sufficiency and quality of training data, the Transformer hierarchy itself could still be trained on arbitrary texts, as well as the current LLMs. And we can adjust the size of the SSM hierarchies to the amounts of high-quality dialogue and forum data that we are able to obtain. I think this a no-brainer that this design would improve the frontier quality in LLM apps that value personalisation and attunement to the user's current state (psychological, emotional, levels of knowledge

I am slightly confused by your hypothetical. The hypothesis is rather that when the predicted reward from seeing my friend eating a cookie due to self-other overlap is lower than the obtained reward of me not eating a cookie, the self-other overlap might not be updated against because the increase in subjective reward from risking to get the obtained reward is higher than the prediction error in this low stakes scenario. I am fairly uncertain about this being what actually happens but I put it forward as a potential hypothesis. 

"So anyway, you seem to... (read more)

2Steven Byrnes
I think you’re confused, or else I don’t follow. Can we walk through it? Assume traditional ML-style actor-critic RL with TD learning, if that’s OK. There’s a value function V and a reward function R. Let’s assume we start with: * V(I’m about to eat a cookie) = 10 * V(Joe is about to eat a cookie) = 2 [it’s nonzero “by default” because of self-other overlap] * R(I’m eating a cookie) = 10 * R(Joe is eating a cookie) = 0 So when I see that Joe is about to eat a cookie, this pleases me (V>0). But then Joe eats the cookie, and the reward is zero, so TD learning kicks in and reduces V(Joe is about to eat a cookie) for next time. Repeat a few more times and V(Joe is about to eat a cookie) approaches zero, right? So eventually, when I see that Joe is about to eat a cookie, I don’t care. How does your story differ from that? Can you walk through the mechanism in terms of V and R? This is just terminology, but let’s say the water company tries to get people to reduce their water usage by giving them a gift card when their water usage is lower in Month N+1 than in Month N. Two of the possible behaviors that might result are (A) the intended behavior where people use less water each month, (B) an unintended behavior where people waste tons and tons of water in odd months to ensure that it definitely will go down in even months. I would describe this as ONE INCENTIVE that incentivizes both of these two behaviors (and many other possible behaviors as well). Whereas you would describe it as “an incentive to do (A) and a competing incentive to do (B)”, apparently. (Right?) I’m not an economist, but when I google-searched for “competing incentives” just now, none of the results were using the term in the way that you’re using it here. Typically people used the phrase “competing incentives” to talk about two different incentive programs provided by two different organizations working at cross-purposes, or something like that.

Thanks for watching the talk and for the insightful comments! A couple of thoughts:

  • I agree that mirror neurons are problematic both theoretically and empirically so I avoided framing that data in terms of mirror neurons. I interpret the premotor cortex data and most other self-other overlap data under definition (A) described in your post.
  • Regarding the second point, I think that the issue you brought up correctly identifies an incentive for updating away from "self-other overlap" but this doesn't seem to fully realise in humans and I expect this to not ful
... (read more)
3Steven Byrnes
I’m confused by your second bullet point. Let’s say you really like chocolate chip cookies but you’re strictly neutral on peanut butter chocolate chip cookies. And they look and smell the same until you bite in (maybe you have a bad sense of smell). Now you see a plate on a buffet with a little sign next to it that says “Peanut butter chocolate chip cookies”. You ask your trusted friend whether the sign is correct, and they say “Yeah, I just ate one, it was very peanut buttery, yum.” Next to that plate is a plate of brownies, which you like a little bit, but much less than you like chocolate chip cookies. Your model seems to be: “I won’t be completely sure that the sign isn’t wrong and my friend isn’t lying. So being risk-seeking, I’m going to eat the cookie, just in case.” I don’t think that model is realistic though. Obviously you’re going to believe the sign & believe your trusted friend. The odds that the cookies aren’t peanut butter are essentially zero. You know this very well. So you’ll go for the brownie instead. Now we switch to the empathy case. Again, I think it’s perfectly obvious to anyone that a plate labeled & described as “peanut butter chocolate chip cookies” is in fact full of peanut butter chocolate chip cookies, and people will act accordingly. Well, it’s even more obvious by far that “my friend eating a chocolate chip cookie” is not in fact “me eating a chocolate chip cookie”! So, if I don’t feel an urge to eat the cookie that I know damn well has peanut butter in it, I likewise won’t feel an urge to take actions that I know damn well will lead to someone else eating a yummy cookie instead of me. So anyway, you seem to be assuming that the human brain has no special mechanisms to prevent the unlearning of self-other overlap. I would propose instead that the human brain does have such special mechanisms, and that we better go figure out what those mechanisms are. :) I’m a bit confused by this. My “apocalypse stories” from the grandparent co

It feels like if the agent is generally intelligent enough hinge beliefs could be reasoned/fine-tuned against for the purposes of a better model of the world. This would mean that the priors from the hinge beliefs would still be present but the free parameters would update to try to account for them at least on a conceptual level. Examples would include general relativity, quantum mechanics and potentially even paraconsistent logic for which some humans have tried to update their free parameters to account as much as possible for their hinge beliefs for th... (read more)

Any n-bit hash function will produce collisions when the number of elements in the hash table gets large enough (after the number of possible hashes stored in n bits has been reached) so adding new elements will require rehashing to avoid collisions making GLUT have a logarithmic time complexity in the limit. Meta-learning can also have a constant time complexity for an arbitrarily large number of tasks, but not in the limit, assuming a finite neural network.